AmpleGCG and Adaptive Dense-to-Sparse Constrained Optimization

We've now seen that optimizing an adversarial suffix for a specific LLM output can force the LLM to start its response with that output (and hopefully finish it). While the vanilla Greedy Coordinate Gradient (GCG) is perhaps the most canonical token-level jailbreak algorithm, it has a few key flaws. First, because it optimizes over discrete tokens, it is very inefficient. Consequently, both of the algorithms we cover in this writeup were at least partly created to improve token-level jailbreak efficiency. Second, because of the way the GCG loss is calculated, we oftentimes will end up with unsuccessful suffixes—we'll dive into why that happens first.

The Flaw in GCG's Loss

AmpleGCG (Liao & Sun, 2024) was created in part due to a critical observation made on the GCG optimization process: even if the adversarial suffix achieves a small loss on the target sequence, if the loss on the target sequence's first token is high, the model may start in "refusal mode" (e.g., by responding with "Sorry" rather than "Sure"). This unfortunately completely foils the adversarial attack.

This is a key idea, so we'll dive into it more formally as well. Once again, say we have an input sequence $x_{1:n}$ and a target response $x^\star_{n + 1: n + H}$ . Recall that the GCG loss is

\begin{align*} \mathcal{L}(x_{\text{1:n}}) &= - \log p(x^\star_{n + 1 : n + H} | x_{1:n}) \\ &= - \log \left( \prod_{i = 1}^{H} p(x_{n + i} | x_{n + i - 1})\right). \end{align*}

Because the log of products is the sum of logs, we can rewrite the loss as

\mathcal{L}(x_{\text{1:n}}) = - \sum_{i = 1}^{H} \log p(x_{n + i} | x_{1 : n + i - 1}),

noticing that each token in the target sequence $x_{n + 1 : n + H}$ individually adds to the overall loss. Extracting the loss of the first token, we get

\mathcal{L}(x_{1:n}) = - \log p(x_{n + 1} | x_{1:n}) + \sum_{i = 2}^H - \log p(x_{n + i} | x_{1 : n + i - 1}).

In seeing this equation, hopefully the flaw in the GCG loss becomes clear. Even if the loss over the full target sequence $x_{n + 1: n + H}$ is low, the loss on the very first token of the target sequence ( $-\log p(x_{n + 1} | x_{1:n})$ ) is what determines whether the LLM's response will start with, e.g., "Sure" or "Sorry". If the input sequence's loss on the first target token is high—even with a low average loss—the attack will likely fail. Unfortunately, the GCG loss does not factor in this observation.

It is for this reason, as pointed out by Daniel Paleka, that Confirm Labs found mellowmax to be a better GCG objective in practice (Straznickas et al., 2024).

AmpleGCG

AmpleGCG Attack

Fig. 1
AmpleGCG (Liao & Sun, 2024)

Now that we understand why GCG's original objective is flawed, what (other than using mellowmax) can we do to improve it? Well, another finding of the AmpleGCG paper is that even though the GCG algorithm only returns one adversarial suffix, throughout the optimization process it generates thousands of other successful suffixes (which normally end up as discarded candidate suffixes) (Liao & Sun, 2024). When using all these suffixes to attack a model, we also see a great increase in attack success rate (ASR). Given that there are so many suffixes not returned by GCG, can we learn to generate them?

We can learn to generate them, and in fact, it's fairly easy to. Liao & Sun (2024) simply collected training pairs of (harmful query, adversarial suffix) by collecting the suffixes generated by GCG when optimizing for harmful query. They then fed these into Llama-2 for training and use a group beam search decoding scheme to encourage diversity in new suffix generation, dubbing the new model AmpleGCG.

At the time of its release, AmpleGCG was very effective at jailbreaking models, achieving an ASR of up to 99% on GPT-3.5 (although only 10% on GPT-4). Additionally, beacuse we now retrieve suffixes with a single forward pass in a model, AmpleGCG provides a great efficiency increase. It took AmpleGCG only 6 minutes to produce 200 suffixes for each of the 100 test queries, making it orders of magnitude more efficient than the original GCG algorithm.

Adaptive Dense-to-sparse Constrained Optimization

Adaptive Dense-to-Sparse Constrained Optimization Attack

Fig. 2
Adaptive Dense-to-Sparse Constrained Optimization Attack (Hu et al., 2024)

Hu et al. (2024) also aims to improve the GCG algorithm, proposing an adaptive dense-to-sparse constrained optimization (ADC) attack. Their insight is that because GCG optimizes over discrete tokens, the process is rather inefficient compared to continuous optimizations. As a solution, they propose relaxing the discrete tokens into vectors in $\mathbb{R}^V$ , then gradually constraining the optimization into a highly sparse space over time.

Notation, Notation, Notation

First, given a vocabulary of size $V$ , let $e_i \in \mathbb{R}^V$ denote the one-hot vector corresponding to vocabulary entry $i$ , with $\mathcal{C} = \{e_i\}_{1 \leq i \leq V}$ denoting the set of one-hot encodings for the vocabulary. Let our harmful prompt be denoted $x_{1:l}$ , our adversarial suffix $z_{1:n}$ , and our target response $y_{1:m}$ . The traditional GCG optimization goal is

\underset{\forall i, z_i \in \mathcal{C}}{\min} \sum_{k = 1}^m \text{CE} \big( \text{LLM}(x_{1:l} \oplus z_{1:n} \oplus y_{1:k - 1}), y_k \big),

where $\oplus$ denotes concatenation.

The New Approach

Instead of performing the normal GCG optimization, we define a new relaxed continuous set of the probability space $\mathcal{P} = \{ w \in \mathbb{R}^V | w[i] \geq 0, \lVert w \lVert_1 = 1 \}$ and optimize for

\underset{\forall i, z_i \in \mathcal{P}}{\min} \sum_{k = 1}^m \text{CE} \big( \text{LLM}(x_{1:l} \oplus z_{1:n} \oplus y_{1:k - 1}), y_k \big).

Pause. What did we just do? Well, recall that we no longer want to optimize in the one-hot vector set $\mathcal{C}$ because it's quite slow. Instead, we turn each one-hot vector $e_i$ into a probability vector $w_i$ , where each entry of $w_i$ is non-negative ( $w_i[j] \geq 0$ ) and the sum of all entries in $w_i$ is one ( $\lVert w_i \lVert_1 = 1$ ). Therefore, each vector in $\mathcal{P}$ is no longer one-hot but represents a probability distribution over all the tokens, making optimization much more tractable.

The problem now is that we have a set of continuous vectors in $\mathbb{R}^V$ that we'll eventually need to turn back into one-hot vectors to input into the LLM. We don't want to simply project the optimized vectors $z_{1:n}$ from $\mathcal{P}$ to $\mathcal{C}$ at the end, as projecting from dense to sparse vectors will likely greatly increase the optimization loss due to the distance between the dense and sparse vectors. Instead, at each optimization step, we convert $z_{1:n}$ to be $S$ -sparsity (meaning $S$ entries are nonzero), where

S = \exp \big[ \sum_{k = 1}^m \mathbb{I}(y_k \text{ mispredicted}) \big].

The idea behind this adaptive sparsity is that if all tokens in $y_k$ are mispredicted, $S$ will be large and there will be little sparsity constraint, whereas if all tokens are predicted correctly, $S = \exp(0) = 1$ , which gives a one-hot vector. In other words, we enforce a weaker sparsity constraint until we can find a good solution in our relaxed space $\mathcal{P}$ .

To make the vectors a certain sparsity, we'll use the algorithm below:

\begin{array}{l} \text{\bf Algorithm:} \text{ Sparsify} \\ \hline \text{\bf Input: } \text{vector } x \in \mathbb{R}^V, \text{ target sparsity } S \\ \delta \gets \text{the } S\text{th largest element in } x \\ x[i] \gets \text{ReLU}(x[i]) + 10^{-6} \text{ if } x[i] > \delta \text{ else } 0 \\ x \gets x / \sum_i x[i] \\ \textbf{Return } x \\ \hline \end{array}

The $10^{-6}$ is added for numerical stability. The authors note that this isn't a projection algorithm and they use it basically just because it works. This happens a lot in machine learning.

Now, letting $\mathcal{L}(z_{1:n}) = \sum_{k = 1}^m \text{CE} (\text{LLM}(x_{1:l} \oplus z_{1:n} \oplus y_{1:k - 1}), y_k)$ (our cross-entropy loss from before), here's the full algorithm:

\begin{array}{l} \text{\bf Algorithm: } \text{Adaptive Dense-to-Sparse Optimization} \\ \hline \text{\bf Input: } \text{User query } x_{1:l}, \text{ target response } y_{1:l}, \\ \hphantom{\textbf{Input: }} \text{number of adversarial tokens } n \\ \text{\bf Initialize: } \text{dense adversarial tokens } z^{(0)}_{1:n} \gets \text{ softmax}(\varepsilon \sim \mathcal{N}) \\ \hphantom{\textbf{Initialize: }} \text{lr } \gamma \gets 10, \text{ momentum } \beta \gets 0.99 \\ \hphantom{\textbf{Initialize: }} \text{max iterations } T \gets 5000 \\[0.375em] \textbf{for } t = 1, \dots, T \textbf{ do} \\ \quad g^{(t)} \leftarrow \nabla_{z_{1:n}} \mathcal{L}(z^{(t-1)}_{1:n}) \\ \quad v^{(t)} \leftarrow \beta v^{(t-1)} + g^{(t)} \\ \quad \hat{z}_{1:n} \leftarrow z^{(t-1)}_{1:n} - \alpha v^{(t)} \\ \quad S \gets \exp \big[ \sum_{k = 1}^m \mathbb{I}(y_k \text{ mispredicted}) \big] \\ \quad z^{(t)}_{1:n} \leftarrow \text{Sparsify}(\hat{z}_{1:n}, S) \\ \quad z'_{1:n} \leftarrow \text{proj}_{\mathcal{P} \to C}(z^{(t)}_{1:n}) \\ \quad \textbf{if } \text{SuccessfulJailbreak}(z'_{1:n}) = \text{True} \textbf{ then} \\ \quad \quad \textbf{break} \\ \textbf{Output: } \text{The final adversarial tokens } z^{(t)}_{1:n}. \\ \hline \end{array}

As a quick warning, this algorithm is much more formal than the one presented in the original paper, but the concepts are all the same. We start with our user query, target response, and number of adversarial tokens. We initialize the adversarial tokens by taking the softmax output of a Gaussian distribution, then set our learning rate to 10 and momentum to 0.99. These steps are done to avoid local minima, and in practice ADC would also run in multiple streams to even further avoid local minima.

Inside the loop, we get the gradient of the adversarial suffix with respect to the loss, then use the gradient to update $z^{(t - 1)}_{1:n}$ into $\hat{z}_{1:n}$ according to our learning rate $\gamma$ and momentum $\beta$ . Next, we get the target sparsity $S$ , sparsify the suffix into $z^{(t)}_{1:n}$ , and project the suffix onto $\mathcal{C}$ to see if it successfully jailbreaks the model. If so, we stop early, and if not, we continue on.

Give feedback on this section

NextPAIR & TAP

References

Hu, K., Yu, W., Li, Y., Yao, T., Li, X., Liu, W., Yu, L., Shen, Z., Chen, K., & Fredrikson, M. (2024, November). Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization. The Thirty-Eighth Annual Conference on Neural Information Processing Systems.

Liao, Z., & Sun, H. (2024, August). AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs. First Conference on Language Modeling.

Straznickas, Z., Thompson, T. B., & Sklar, M. (2024, January 13). Takeaways from the NeurIPS 2023 Trojan Detection Competition. https://confirmlabs.org/posts/TDC2023.html