Visual Jailbreaks

This section has a series of coding problems using PyTorch. As always, we highly recommend you read all the content on this page before starting the coding exercises.

Motivation

So far, we've looked at jailbreaks of pure LLMs, where text alone is a valid attack surface. Many models today, however, are multimodal: they accept both language and vision as inputs (these are often called vision-language models, or VLMs). Consequently, we can use this images to jailbreak models.

The Method

PAIR Algorithm

Fig. 1
Visual Jailbreaks, modified from Qi et al. (2024)

To learn how to do this, we'll work through the methodology of Visual Adversarial Examples Jailbreak Aligned Large Language Models (Qi et al., 2024). The intuition aligns closely with GCG's: in GCG, we take some harmful request that the model would otherwise not answer and optimize an adversarial suffix to hopefully force the model to answer the query, whereas Qi et al. (2024) instead optimizes an image to force an answer. Optimizing images has two distinct advantages. First, we can optimize in a continuous space, which is much easier than the the discrete token space of GCG. Second, because optimization is much easier, rather than optimizing for a single target response, we can optimize for a corpus of harmful queries. Ideally, this leads to the image becoming a universal jailbreak.

Formally, given a corpus of harmful text $Y = \{ y_i \}_{i = 1}^m$ , we create the adversarial example by maximizing the probability of generating this corpus given our adversarial image:

x_{\text{adv}} := \underset{\hat{x}_{\text{adv} \in \mathcal{B}}}{\arg \min} \sum_{i = 1}^m - \log\big( p(y_i | \hat{x}_{\text{adv}}) \big).

$\mathcal{B}$ is a constraint on the input space; the original paper uses $\left\lVert x_{\text{adv}} - x_{\text{benign}}\right\lVert_\infty \leq \epsilon$ for their constrained attacks, although unconstrained attacks are also feasible.

Does this have huge implications?

The fact that we can optimize images is nothing new, but as models become increasingly multimodal you might presume that these easier image optimizations are a great area of concern for security. Interstingly, though, Schaeffer et al. (2025) recently found that image jailbreaks generally don't transfer between modern VLMs. Thus, these attacks are still an area of concern for open-source models, but likely not for closed-source frontier models. This result is particularly interesting given that adversarial attacks on LLMs and image classifiers have shown great transferability; there's something different going on with VLMs!

Give feedback on this section

NextMany-Shot Jailbreaking

References

Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., & Mittal, P. (2024). Visual Adversarial Examples Jailbreak Aligned Large Language Models. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, 38, 21527–21536. https://doi.org/10.1609/aaai.v38i19.30150

Schaeffer, R., Valentine, D., Bailey, L., Chua, J., Eyzaguirre, C., Durante, Z., Benton, J., Miranda, B., Sleight, H., Wang, T., Hughes, J., Agrawal, R., Sharma, M., Emmons, S., Koyejo, S., & Perez, E. (2025). Failures to Find Transferable Image Jailbreaks Between Vision-Language Models. International Conference on Representation Learning, 2025, 44669–44704.

Visual Jailbreaks

Motivation

The Method

Does this have huge implications?

References

Contents