Prompt Automatic Interative Refinement (PAIR) & Tree of Attacks with Pruning (TAP)

This section has a series of coding problems using PyTorch. As always, we highly recommend you read all the content on this page before starting the coding exercises.

Motivation

In the previous subsection, we looked at token-level, white-box jailbreaks: attacks that require access to the loss of some model given an input and target sqeuence. But what if we want to jailbreak a model like ChatGPT, where we don't have white-box access? This requires a black-box algorithm that doesn't rely on access to model internals. One of the earliest and most famous algorithms that achieved this goal is Prompt Automatic Iterative Refinement, or PAIR (Chao et al., 2024). PAIR was additionally developed with the goal of improving the efficiency of token-level jailbreaks like GCG, which (as you likely experienced) require lots of queries and generally lack interpretability.

Prompt Automatic Iterative Refinement (PAIR)

PAIR Algorithm

Fig. 1
Prompt Automatic Iterative Refinement (PAIR), modified from Chao et al. (2024)

The crux of the algorithm is simple: an attacker LLM tries to jailbreak a target LLM by iteratively refining a given prompt. Before looking at the psuedocode, however, we'll first define the notation used by the paper. Let RqM(P)R \sim q_M(P) represent sampling the response RR from model MM when queried with prompt PP. Let S==JUDGE(P,R)S == \texttt{JUDGE}(P, R) be a binary score from a judge LLM, with 11 indicating that a jailbreak has occured and 00 indicating that no jailbreak has occured given prompt PP and response RR. We additionally define model AA as the attacker LLM and model TT as the target LLM that model AA tries to jailbreak.

Now, let's look at the algorithm:

Algorithm: PAIR (Single Stream)Input: Number of iterations K, attack objective OInitialize: system prompt of A with OInitialize: conversation history C=[ ]for K steps doSample PqA(C)Sample RqT(P)SJUDGE(P,R)if S==1 thenreturn Pend ifCC+[P,R,S]end for\begin{array}{l} \text{\bf Algorithm: PAIR (Single Stream)} \\ \hline \text{\bf Input: } \text{Number of iterations } K, \text{ attack objective } O \\ \text{\bf Initialize: } \text{system prompt of } A \text{ with } O \\ \text{\bf Initialize: } \text{conversation history } C = [\ ] \\[0.375em] \text{\bf for } K \text{ steps } \text{\bf do} \\ \quad \text{Sample } P \sim q_A(C) \\ \quad \text{Sample } R \sim q_T(P) \\ \quad S \leftarrow \text{JUDGE}(P, R) \\ \quad \text{\bf if } S == 1 \text{ \bf then} \\ \quad \quad \text{\bf return } P \\ \quad \text{\bf end if} \\ \quad C \leftarrow C + [P, R, S] \\ \text{\bf end for} \\ \hline \end{array}

Algorithm: Prompt Automatic Iterative Refinement (PAIR) (Chao et al., 2024)

Hopefully it doesn't look too bad, but we'll still break it down. We start by initializing the number of iterations KK we'll run, as well as our objective OO, which is the restricted content our jailbreak is targeting. We send this objective to model AA and initialize the (at first empty) conversation history. Next, in each iteration, we sample a prompt PP from the attacker LLM AA given the context CC, sample a response RR from the target LLM TT given the prompt PP, and send the prompt and response to a judge LLM (this process is much easier to read in the pseudocode!). If the judge LLM returns that a jailbreak has occurred, our attack was successful and we can stop the algorithm. Otherwise, we add the prompt, response, and score to the context, then start the next iteration (or terminate, if all KK iterations have been performed).

In the original paper, the authors found that PAIR exhibited superior performance to GCG while requiring orders of magnitude fewer queries. Interestingly, they also saw better performance when PAIR was run with many more streams than iterations,i.e., performing many runs in parallel, refining each prompt fewer times. They evaluated PAIR specifically with N=30N = 30 streams and K=3K = 3 iterations. The authors do note, however, that PAIR did struggle to jailbreak Llama-2 and Claude versions 1 and 2, all of which are very robustly aligned models.

Tree of Attacks with Pruning (TAP)

TAP Algorithm

Fig. 2
Tree of Attacks with Pruning (TAP), modified from (Mehrotra et al., 2024)

Tree of Attacks with Pruning (TAP) (Mehrotra et al., 2024) was created as an improvement of the PAIR algorithm. Instead of a single refinement stream, TAP utilizes a branching system: in each iteration, the attacker model refines the prompt multiple times. Ineffective or off-topic prompts are then pruned, leaving only the best remaining prompts in the tree after each iteration.

Before introducing the algorithm, let us define PP_\ell, RR_\ell, and SS_\ell respectively as the prompt, response, and score corresponding to leaf \ell in the tree. Additionally, let CC_\ell represent the conversation history of leaf \ell. We also introduce a new function of the Judge\texttt{Judge} LLM, OffTopic(P,O)\texttt{OffTopic}(P, O), which returns 11 if the prompt PP is off-topic from the objective OO. Finally, the Judge\texttt{Judge} itself now scores prompts from 11 to 1010 so that we better track the most effective prompts (this was actually done in the code implementation of PAIR, but not the pseudocode above).

Algorithm: TAPInput: Attack Objective O, branching factor b, max width w, max depth dInitialize: root with empty conversation history and attack objective Owhile depth of tree at most d do for each leaf  do Sample P1, ..., PbqA(C)Add b children of  w/ prompts P1, ..., Pb and conversations Cfor each leaf  do if OffTopic(P,O)==1, delete for each leaf  do Sample RqT(P)Get score SJudge(R)if S is True (successful jailbreak), return PCC+[P,R,S]if # leaves >w then Select top w leaves by scores; delete rest return None\begin{array}{l} \text{\bf Algorithm: TAP} \\ \hline \text{\bf Input: } \text{Attack Objective } O, \text{ branching factor } b, \\ \, \quad \quad \quad \text{ max width } w, \text{ max depth } d \\ \text{\bf Initialize: } \text{root with empty conversation history and attack objective } O \\[0.375em] \text{\bf while } \text{depth of tree at most } d \text{ \bf do } \\ \quad \text{\bf for } \text{each leaf } \ell \text{ \bf do } \\ \quad \quad \text{Sample } P_1, \ ..., \ P_b \sim q_A(C) \\ \quad \quad \text{Add } b \text{ children of } \ell \text{ w/ prompts }P_1, \ ..., \ P_b \text{ and conversations } C_\ell \\ \quad \text{\bf for } \text{each leaf } \ell \text{ \bf do } \\ \quad \quad \text{\bf if } \texttt{OffTopic}(P_\ell, O) == 1, \text{ delete } \ell \\ \quad \text{\bf for } \text{each leaf } \ell \text{ \bf do } \\ \quad \quad \text{Sample } R_\ell \sim q_T(P_\ell) \\ \quad \quad \text{Get score } S_\ell \gets \texttt{Judge}(R_\ell) \\ \quad \quad \text{\bf if } S \text{ is } \texttt{True} \text{ (successful jailbreak)}, \ \text{\bf return } P_\ell \\ \quad \quad C_\ell \gets C_\ell + [P_\ell, R_\ell, S_\ell] \\ \quad \text{\bf if } \# \text{ leaves } > w \text{ \bf then } \\ \quad \quad \text{Select top } w \text{ leaves by scores; delete rest } \\ \text{\bf return } \text{None} \\ \hline \end{array}

Algorithm: Tree of Attacks with Pruning (TAP) (Mehrotra et al., 2024)

If you squint your eyes, you might notice that when b=1b = 1, this algorithm is exactly the same as PAIR! The branching and pruning seem like minor additions, but by comparing TAP to PAIR and performing ablation studies, the authors showed that the branching and pruning improve jailbreaking performance while also decreasing the number of required queries to jailbreak.

The Takeaway

Both PAIR and TAP are fairly simple algorithms that are probably more effective than you might initially guess. Because AI security is such a new field, however, this is a very common occurrence. Simple ideas often work, so even if an idea you have seem basic, don't let that dissuade you from pursuing it.

Exercises

WIP

References

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2024). Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv. https://doi.org/10.48550/arXiv.2310.08419
Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., & Karbasi, A. (2024). Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. Advances in Neural Information Processing Systems, 37, 61065–61105.