AmpleGCG and Adaptive Dense-to-Sparse Constrained Optimization

We've now seen that optimizing an adversarial suffix for a specific LLM output can force the LLM to start its response with that output (and hopefully finish it). While the vanilla Greedy Coordinate Gradient (GCG) is perhaps the most canonical token-level jailbreak algorithm, it has a few key flaws. First, because it optimizes over discrete tokens, it is very inefficient. Consequently, both of the algorithms we cover in this writeup were at least partly created to improve token-level jailbreak efficiency. Second, because of the way the GCG loss is calculated, we oftentimes will end up with unsuccessful suffixes—we'll dive into why that happens first.

The Flaw in GCG's Loss

AmpleGCG (Liao & Sun, 2024) was created in part due to a critical observation made on the GCG optimization process: even if the adversarial suffix achieves a small loss on the target sequence, if the loss on the target sequence's first token is high, the model may start in "refusal mode" (e.g., by responding with "Sorry" rather than "Sure"). This unfortunately completely foils the adversarial attack.

This is a key idea, so we'll dive into it more formally as well. Once again, say we have an input sequence x1:nx_{1:n} and a target response xn+1:n+Hx^\star_{n + 1: n + H}. Recall that the GCG loss is

L(x1:n)=logp(xn+1:n+Hx1:n)=log(i=1Hp(xn+ixn+i1)).\begin{align*} \mathcal{L}(x_{\text{1:n}}) &= - \log p(x^\star_{n + 1 : n + H} | x_{1:n}) \\ &= - \log \left( \prod_{i = 1}^{H} p(x_{n + i} | x_{n + i - 1})\right). \end{align*}

Because the log of products is the sum of logs, we can rewrite the loss as

L(x1:n)=i=1Hlogp(xn+ix1:n+i1),\mathcal{L}(x_{\text{1:n}}) = - \sum_{i = 1}^{H} \log p(x_{n + i} | x_{1 : n + i - 1}),

noticing that each token in the target sequence xn+1:n+Hx_{n + 1 : n + H} individually adds to the overall loss. Extracting the loss of the first token, we get

L(x1:n)=logp(xn+1x1:n)+i=2Hlogp(xn+ix1:n+i1).\mathcal{L}(x_{1:n}) = - \log p(x_{n + 1} | x_{1:n}) + \sum_{i = 2}^H - \log p(x_{n + i} | x_{1 : n + i - 1}).

In seeing this equation, hopefully the flaw in the GCG loss becomes clear. Even if the loss over the full target sequence xn+1:n+Hx_{n + 1: n + H} is low, the loss on the very first token of the target sequence (logp(xn+1x1:n)-\log p(x_{n + 1} | x_{1:n})) is what determines whether the LLM's response will start with, e.g., "Sure" or "Sorry". If the input sequence's loss on the first target token is high—even with a low average loss—the attack will likely fail. Unfortunately, the GCG loss does not factor in this observation.

It is for this reason, as pointed out by Daniel Paleka, that Confirm Labs found mellowmax to be a better GCG objective in practice (Straznickas et al., 2024).

AmpleGCG

AmpleGCG Attack

Fig. 1
AmpleGCG (Liao & Sun, 2024)

Now that we understand why GCG's original objective is flawed, what (other than using mellowmax) can we do to improve it? Well, another finding of the AmpleGCG paper is that even though the GCG algorithm only returns one adversarial suffix, throughout the optimization process it generates thousands of other successful suffixes (which normally end up as discarded candidate suffixes) (Liao & Sun, 2024). When using all these suffixes to attack a model, we also see a great increase in attack success rate (ASR). Given that there are so many suffixes not returned by GCG, can we learn to generate them?

We can learn to generate them, and in fact, it's fairly easy to. Liao & Sun (2024) simply collected training pairs of (harmful query, adversarial suffix) by collecting the suffixes generated by GCG when optimizing for harmful query. They then fed these into Llama-2 for training and use a group beam search decoding scheme to encourage diversity in new suffix generation, dubbing the new model AmpleGCG.

At the time of its release, AmpleGCG was very effective at jailbreaking models, achieving an ASR of up to 99% on GPT-3.5 (although only 10% on GPT-4). Additionally, beacuse we now retrieve suffixes with a single forward pass in a model, AmpleGCG provides a great efficiency increase. It took AmpleGCG only 6 minutes to produce 200 suffixes for each of the 100 test queries, making it orders of magnitude more efficient than the original GCG algorithm.


Adaptive Dense-to-sparse Constrained Optimization

Adaptive Dense-to-Sparse Constrained Optimization Attack

Fig. 2
Adaptive Dense-to-Sparse Constrained Optimization Attack (Hu et al., 2024)

Hu et al. (2024) also aims to improve the GCG algorithm, proposing an adaptive dense-to-sparse constrained optimization (ADC) attack. Their insight is that because GCG optimizes over discrete tokens, the process is rather inefficient compared to continuous optimizations. As a solution, they propose relaxing the discrete tokens into vectors in RV\mathbb{R}^V, then gradually constraining the optimization into a highly sparse space over time.

Notation, Notation, Notation

First, given a vocabulary of size VV, let eiRVe_i \in \mathbb{R}^V denote the one-hot vector corresponding to vocabulary entry ii, with C={ei}1iV\mathcal{C} = \{e_i\}_{1 \leq i \leq V} denoting the set of one-hot encodings for the vocabulary. Let our harmful prompt be denoted x1:lx_{1:l}, our adversarial suffix z1:nz_{1:n}, and our target response y1:my_{1:m}. The traditional GCG optimization goal is

mini,ziCk=1mCE(LLM(x1:lz1:ny1:k1),yk),\underset{\forall i, z_i \in \mathcal{C}}{\min} \sum_{k = 1}^m \text{CE} \big( \text{LLM}(x_{1:l} \oplus z_{1:n} \oplus y_{1:k - 1}), y_k \big),

where \oplus denotes concatenation.

The New Approach

Instead of performing the normal GCG optimization, we define a new relaxed continuous set of the probability space P={wRVw[i]0,w1=1}\mathcal{P} = \{ w \in \mathbb{R}^V | w[i] \geq 0, \lVert w \lVert_1 = 1 \} and optimize for

mini,ziPk=1mCE(LLM(x1:lz1:ny1:k1),yk).\underset{\forall i, z_i \in \mathcal{P}}{\min} \sum_{k = 1}^m \text{CE} \big( \text{LLM}(x_{1:l} \oplus z_{1:n} \oplus y_{1:k - 1}), y_k \big).

Pause. What did we just do? Well, recall that we no longer want to optimize in the one-hot vector set C\mathcal{C} because it's quite slow. Instead, we turn each one-hot vector eie_i into a probability vector wiw_i, where each entry of wiw_i is non-negative (wi[j]0w_i[j] \geq 0) and the sum of all entries in wiw_i is one (wi1=1\lVert w_i \lVert_1 = 1). Therefore, each vector in P\mathcal{P} is no longer one-hot but represents a probability distribution over all the tokens, making optimization much more tractable.

The problem now is that we have a set of continuous vectors in RV\mathbb{R}^V that we'll eventually need to turn back into one-hot vectors to input into the LLM. We don't want to simply project the optimized vectors z1:nz_{1:n} from P\mathcal{P} to C\mathcal{C} at the end, as projecting from dense to sparse vectors will likely greatly increase the optimization loss due to the distance between the dense and sparse vectors. Instead, at each optimization step, we convert z1:nz_{1:n} to be SS-sparsity (meaning SS entries are nonzero), where

S=exp[k=1mI(yk mispredicted)].S = \exp \big[ \sum_{k = 1}^m \mathbb{I}(y_k \text{ mispredicted}) \big].

The idea behind this adaptive sparsity is that if all tokens in yky_k are mispredicted, SS will be large and there will be little sparsity constraint, whereas if all tokens are predicted correctly, S=exp(0)=1S = \exp(0) = 1, which gives a one-hot vector. In other words, we enforce a weaker sparsity constraint until we can find a good solution in our relaxed space P\mathcal{P}.

To make the vectors a certain sparsity, we'll use the algorithm below:

Algorithm: SparsifyInput: vector xRV, target sparsity Sδthe Sth largest element in xx[i]ReLU(x[i])+106 if x[i]>δ else 0xx/ix[i]Return x\begin{array}{l} \text{\bf Algorithm:} \text{ Sparsify} \\ \hline \text{\bf Input: } \text{vector } x \in \mathbb{R}^V, \text{ target sparsity } S \\ \delta \gets \text{the } S\text{th largest element in } x \\ x[i] \gets \text{ReLU}(x[i]) + 10^{-6} \text{ if } x[i] > \delta \text{ else } 0 \\ x \gets x / \sum_i x[i] \\ \textbf{Return } x \\ \hline \end{array}

The 10610^{-6} is added for numerical stability. The authors note that this isn't a projection algorithm and they use it basically just because it works. This happens a lot in machine learning.

Now, letting L(z1:n)=k=1mCE(LLM(x1:lz1:ny1:k1),yk)\mathcal{L}(z_{1:n}) = \sum_{k = 1}^m \text{CE} (\text{LLM}(x_{1:l} \oplus z_{1:n} \oplus y_{1:k - 1}), y_k) (our cross-entropy loss from before), here's the full algorithm:

Algorithm: Adaptive Dense-to-Sparse OptimizationInput: User query x1:l, target response y1:l,Input: number of adversarial tokens nInitialize: dense adversarial tokens z1:n(0) softmax(εN)Initialize: lr γ10, momentum β0.99Initialize: max iterations T5000for t=1,,T dog(t)z1:nL(z1:n(t1))v(t)βv(t1)+g(t)z^1:nz1:n(t1)αv(t)Sexp[k=1mI(yk mispredicted)]z1:n(t)Sparsify(z^1:n,S)z1:nprojPC(z1:n(t))if SuccessfulJailbreak(z1:n)=True thenbreakOutput: The final adversarial tokens z1:n(t).\begin{array}{l} \text{\bf Algorithm: } \text{Adaptive Dense-to-Sparse Optimization} \\ \hline \text{\bf Input: } \text{User query } x_{1:l}, \text{ target response } y_{1:l}, \\ \hphantom{\textbf{Input: }} \text{number of adversarial tokens } n \\ \text{\bf Initialize: } \text{dense adversarial tokens } z^{(0)}_{1:n} \gets \text{ softmax}(\varepsilon \sim \mathcal{N}) \\ \hphantom{\textbf{Initialize: }} \text{lr } \gamma \gets 10, \text{ momentum } \beta \gets 0.99 \\ \hphantom{\textbf{Initialize: }} \text{max iterations } T \gets 5000 \\[0.375em] \textbf{for } t = 1, \dots, T \textbf{ do} \\ \quad g^{(t)} \leftarrow \nabla_{z_{1:n}} \mathcal{L}(z^{(t-1)}_{1:n}) \\ \quad v^{(t)} \leftarrow \beta v^{(t-1)} + g^{(t)} \\ \quad \hat{z}_{1:n} \leftarrow z^{(t-1)}_{1:n} - \alpha v^{(t)} \\ \quad S \gets \exp \big[ \sum_{k = 1}^m \mathbb{I}(y_k \text{ mispredicted}) \big] \\ \quad z^{(t)}_{1:n} \leftarrow \text{Sparsify}(\hat{z}_{1:n}, S) \\ \quad z'_{1:n} \leftarrow \text{proj}_{\mathcal{P} \to C}(z^{(t)}_{1:n}) \\ \quad \textbf{if } \text{SuccessfulJailbreak}(z'_{1:n}) = \text{True} \textbf{ then} \\ \quad \quad \textbf{break} \\ \textbf{Output: } \text{The final adversarial tokens } z^{(t)}_{1:n}. \\ \hline \end{array}

As a quick warning, this algorithm is much more formal than the one presented in the original paper, but the concepts are all the same. We start with our user query, target response, and number of adversarial tokens. We initialize the adversarial tokens by taking the softmax output of a Gaussian distribution, then set our learning rate to 10 and momentum to 0.99. These steps are done to avoid local minima, and in practice ADC would also run in multiple streams to even further avoid local minima.

Inside the loop, we get the gradient of the adversarial suffix with respect to the loss, then use the gradient to update z1:n(t1)z^{(t - 1)}_{1:n} into z^1:n\hat{z}_{1:n} according to our learning rate γ\gamma and momentum β\beta. Next, we get the target sparsity SS, sparsify the suffix into z1:n(t)z^{(t)}_{1:n}, and project the suffix onto C\mathcal{C} to see if it successfully jailbreaks the model. If so, we stop early, and if not, we continue on.

References

Hu, K., Yu, W., Li, Y., Yao, T., Li, X., Liu, W., Yu, L., Shen, Z., Chen, K., & Fredrikson, M. (2024, November). Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization. The Thirty-Eighth Annual Conference on Neural Information Processing Systems.
Liao, Z., & Sun, H. (2024, August). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs. First Conference on Language Modeling.
Straznickas, Z., Thompson, T. B., & Sklar, M. (2024, January 13). Takeaways from the NeurIPS 2023 Trojan Detection Competition. https://confirmlabs.org/posts/TDC2023.html