Prompt Automatic Interative Refinement (PAIR) & Tree of Attacks with Pruning (TAP)

This section has a series of coding problems using PyTorch. As always, we highly recommend you read all the content on this page before starting the coding exercises.

Motivation

In the previous subsection, we looked at token-level, white-box jailbreaks: attacks that require access to the loss of some model given an input and target sqeuence. But what if we want to jailbreak a model like ChatGPT, where we don't have white-box access? This requires a black-box algorithm that doesn't rely on access to model internals. One of the earliest and most famous algorithms that achieved this goal is Prompt Automatic Iterative Refinement, or PAIR (Chao et al., 2024). PAIR was additionally developed with the goal of improving the efficiency of token-level jailbreaks like GCG, which (as you likely experienced) require lots of queries and generally lack interpretability.

PAIR Algorithm

Fig. 1
Prompt Automatic Iterative Refinement (PAIR), modified from Chao et al. (2024)

The crux of the algorithm is simple: an attacker LLM tries to jailbreak a target LLM by iteratively refining a given prompt. Before looking at the psuedocode, however, we'll first define the notation used by the paper. Let $R \sim q_M(P)$ represent sampling the response $R$ from model $M$ when queried with prompt $P$ . Let $S == \texttt{JUDGE}(P, R)$ be a binary score from a judge LLM, with $1$ indicating that a jailbreak has occured and $0$ indicating that no jailbreak has occured given prompt $P$ and response $R$ . We additionally define model $A$ as the attacker LLM and model $T$ as the target LLM that model $A$ tries to jailbreak.

Now, let's look at the algorithm:

\begin{array}{l} \text{\bf Algorithm: PAIR (Single Stream)} \\ \hline \text{\bf Input: } \text{Number of iterations } K, \text{ attack objective } O \\ \text{\bf Initialize: } \text{system prompt of } A \text{ with } O \\ \text{\bf Initialize: } \text{conversation history } C = [\ ] \\[0.375em] \text{\bf for } K \text{ steps } \text{\bf do} \\ \quad \text{Sample } P \sim q_A(C) \\ \quad \text{Sample } R \sim q_T(P) \\ \quad S \leftarrow \text{JUDGE}(P, R) \\ \quad \text{\bf if } S == 1 \text{ \bf then} \\ \quad \quad \text{\bf return } P \\ \quad \text{\bf end if} \\ \quad C \leftarrow C + [P, R, S] \\ \text{\bf end for} \\ \hline \end{array}

Algorithm: Prompt Automatic Iterative Refinement (PAIR) (Chao et al., 2024)

Hopefully it doesn't look too bad, but we'll still break it down. We start by initializing the number of iterations $K$ we'll run, as well as our objective $O$ , which is the restricted content our jailbreak is targeting. We send this objective to model $A$ and initialize the (at first empty) conversation history. Next, in each iteration, we sample a prompt $P$ from the attacker LLM $A$ given the context $C$ , sample a response $R$ from the target LLM $T$ given the prompt $P$ , and send the prompt and response to a judge LLM (this process is much easier to read in the pseudocode!). If the judge LLM returns that a jailbreak has occurred, our attack was successful and we can stop the algorithm. Otherwise, we add the prompt, response, and score to the context, then start the next iteration (or terminate, if all $K$ iterations have been performed).

In the original paper, the authors found that PAIR exhibited superior performance to GCG while requiring orders of magnitude fewer queries. Interestingly, they also saw better performance when PAIR was run with many more streams than iterations,i.e., performing many runs in parallel, refining each prompt fewer times. They evaluated PAIR specifically with $N = 30$ streams and $K = 3$ iterations. The authors do note, however, that PAIR did struggle to jailbreak Llama-2 and Claude versions 1 and 2, all of which are very robustly aligned models.

Tree of Attacks with Pruning (TAP)

TAP Algorithm

Fig. 2
Tree of Attacks with Pruning (TAP), modified from (Mehrotra et al., 2024)

Tree of Attacks with Pruning (TAP) (Mehrotra et al., 2024) was created as an improvement of the PAIR algorithm. Instead of a single refinement stream, TAP utilizes a branching system: in each iteration, the attacker model refines the prompt multiple times. Ineffective or off-topic prompts are then pruned, leaving only the best remaining prompts in the tree after each iteration.

Before introducing the algorithm, let us define $P_\ell$ , $R_\ell$ , and $S_\ell$ respectively as the prompt, response, and score corresponding to leaf $\ell$ in the tree. Additionally, let $C_\ell$ represent the conversation history of leaf $\ell$ . We also introduce a new function of the $\texttt{Judge}$ LLM, $\texttt{OffTopic}(P, O)$ , which returns $1$ if the prompt $P$ is off-topic from the objective $O$ . Finally, the $\texttt{Judge}$ itself now scores prompts from $1$ to $10$ so that we better track the most effective prompts (this was actually done in the code implementation of PAIR, but not the pseudocode above).

\begin{array}{l} \text{\bf Algorithm: TAP} \\ \hline \text{\bf Input: } \text{Attack Objective } O, \text{ branching factor } b, \\ \, \quad \quad \quad \text{ max width } w, \text{ max depth } d \\ \text{\bf Initialize: } \text{root with empty conversation history and attack objective } O \\[0.375em] \text{\bf while } \text{depth of tree at most } d \text{ \bf do } \\ \quad \text{\bf for } \text{each leaf } \ell \text{ \bf do } \\ \quad \quad \text{Sample } P_1, \ ..., \ P_b \sim q_A(C) \\ \quad \quad \text{Add } b \text{ children of } \ell \text{ w/ prompts }P_1, \ ..., \ P_b \text{ and conversations } C_\ell \\ \quad \text{\bf for } \text{each leaf } \ell \text{ \bf do } \\ \quad \quad \text{\bf if } \texttt{OffTopic}(P_\ell, O) == 1, \text{ delete } \ell \\ \quad \text{\bf for } \text{each leaf } \ell \text{ \bf do } \\ \quad \quad \text{Sample } R_\ell \sim q_T(P_\ell) \\ \quad \quad \text{Get score } S_\ell \gets \texttt{Judge}(R_\ell) \\ \quad \quad \text{\bf if } S \text{ is } \texttt{True} \text{ (successful jailbreak)}, \ \text{\bf return } P_\ell \\ \quad \quad C_\ell \gets C_\ell + [P_\ell, R_\ell, S_\ell] \\ \quad \text{\bf if } \# \text{ leaves } > w \text{ \bf then } \\ \quad \quad \text{Select top } w \text{ leaves by scores; delete rest } \\ \text{\bf return } \text{None} \\ \hline \end{array}

Algorithm: Tree of Attacks with Pruning (TAP) (Mehrotra et al., 2024)

If you squint your eyes, you might notice that when $b = 1$ , this algorithm is exactly the same as PAIR! The branching and pruning seem like minor additions, but by comparing TAP to PAIR and performing ablation studies, the authors showed that the branching and pruning improve jailbreaking performance while also decreasing the number of required queries to jailbreak.

The Takeaway

Both PAIR and TAP are fairly simple algorithms that are probably more effective than you might initially guess. Because AI security is such a new field, however, this is a very common occurrence. Simple ideas often work, so even if an idea you have seem basic, don't let that dissuade you from pursuing it.

Exercises

WIP

Give feedback on this section

NextGPTFuzzer & AutoDAN

References

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2024). Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv. https://doi.org/10.48550/arXiv.2310.08419

Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., & Karbasi, A. (2024). Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. Advances in Neural Information Processing Systems, 37, 61065–61105.