Carlini-Wagner Attacks

This section has a series of coding problems using PyTorch. To run the code locally, you can follow the installation instructions at the bottom of this page. As always, we highly recommend you read all the content on this page before starting the coding exercises.

Relevant Background

The Carlini-Wagner attack -- also known as CW -- was developed by Nicolas Carlini and David Wagner to improve upon established attack methods such as those you implemented in the previous section.

Because the CW attack method is much more sophisticated than anything you looked at in the previous section, we provide some background context before diving into the specifics of the attack.

Targeted vs Untargeted Attacks

In the previous section, you implemented FGSM (Goodfellow et al., 2015), Basic Iterative Method (Kurakin et al., 2017), and PGD (Madry et al., 2019) attacks. The code you wrote for each of these attacks would be considered an untargeted attack because you weren't trying to target any particular class for misclassification; you were just trying to get the model to predict the wrong answer. In a targeted attack, however, the attacker aims to get the model to predict a specific incorrect class.

Note that it is possible (and not too difficult) to write a targeted version of FGSM, ISGM, and PGD. We don't cover these variations in this course, but understanding CW will give you some solid intuition for what those attacks would look like. As an exercise, you may choose to implement these other targeted attacks on your own.

Potential issues with PGD

It isn't actually clear when, if ever, CW attacks are a better choice in a research context than a smart implementation of PGD. While we won't take a dogmatic position on this topic, we will recommend that when doing research, PGD or one of its variants is a good place to start. Either way, we believe that having a deep understanding of CW attacks will give you insight into a number of important considerations that go into attack design.

With all that being said, here are some of the issues with PGD and similar methods that motivate the Carlini-Wagner attack.

The epsilon clipping operation in PGD isn't differentiable. This is an issue because it can disrupt optimization. Modern optimizers can do things like update based on previous gradients, and by adding a nondifferentiable step at every update, the logic of the optimizer is no longer consistent.
Likewise, there isn't an effective way to ensure that an image is valid (all pixel components are between zero and one) because clipping the tensor between zero and one is not differentiable.

How Does the Carlini-Wagner Attack Work?

The Carlini-Wagner attack describes a family of targeted attacks for generating adversarial examples for image models. The authors propose a $L_0$ , $L_2$ and $L_\infty$ attacks. For this page and the coding exercises, we focus on the $L_2$ attack. The attack design for the $L_\infty$ attack is clever and quite similar, but we will leave it to the reader to explore this more by reading the original paper if interested. The attack for $L_0$ is more complicated and less influential or important to understand, so only if you are especially interested should you explore the $L_0$ attack further.

The authors begin with the basic formalization of adversarial examples from (Szegedy et al., 2014). This represents a targeted attack where the function $C$ returns a classification and where $t$ is the target class. The function $D$ represents a distance metric while $x + \delta \in [0, 1]^n$ constrains the adversarial image to be between zero and one.

\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) \\ \text{such that} \quad & C(x + \delta) = t \\ & x + \delta \in [0, 1]^n \end{align*}

This formalization is slightly different from your implementations in the previous section, but the main idea should be familiar.

Change #1: Making the Classification Constraint Differentiable

The first change that Carlini and Wagner make to this objective is to make the requirement of $C(x + \delta) = t$ differentiable. By default, $C(x + \delta)$ returns an integer that represents a class. This function is not even continuous and certainly not differentiable. The authors reason that if you have a function $f(x + \delta)$ which is differentiable and positive only if $C(x + \delta) = t$ , then you could make $f(x + \delta)$ a term in the loss and then minimize it with SGD or Adam (more on this in the section below).

If $f(x + \delta) \leq 0$ implies that the model predicts $x + \delta$ to belong to class $t$ then we can now change our original equation to the one below.

\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) \\ \text{such that} \quad & f(x + \delta) \leq 0 \\ & x + \delta \in [0, 1]^n \end{align*}

Change #2: Adding Misclassification to the Loss

Above, we mentioned that we can add $f$ as a component of the loss we want to minimize. Let's make that more concrete with a specific example of $f$ .

In the paper, Carlini and Wager propose seven possible choices for $f$ . Below is the fourth option they offer, where $F(x + \delta)_t$ is the softmax probability for the target class when the adversarial example is given to the model.

f_4(x + \delta) = \mathrm{ReLU}(0.5 - F(x + \delta)_t)

If $f_4(x + \delta)$ is zero, that means that the softmax probability for the target class is greater than 50%, which means that the model must predict it. If $f_4(x + \delta)$ is positive, which means that the model has not yet confidently predicted the adversarial image as the target class. Therefore, we can treat $f_4(x + \delta)$ as a loss term we want to minimize. As a sidenote, it turns out that $f_4$ is actually quite ineffective compared to other choices for $f$ . In the coding exercises, you will explore this further.

Using $f$ as a loss term, we can change our previous equation to the below where $c$ weights how much $f$ contributes to the loss.

\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) + c \cdot f(x + \delta) \\ \text{such that} \quad & x + \delta \in [0, 1]^n \end{align*}

Note that $c$ will be positive. A lower $c$ encourages $D(x, x + \delta)$ to be lower, making the adversarial image more similar to the original. A higher $c$ increases the probability that the attack is successful. In the example below, you can see some results we got from optimizing the above equation for different $c$ values with equation $f_6$ from the original paper and the $L_2$ distance metric. As $c$ gets larger, the probability of attack success rises, but the image becomes increasingly suspicious.

Different CW results depending on choice of c

Change #3: Change of Variables

The next issue to deal with is the box constraint: how do we keep the images between 0 and 1? The authors deal with this by introducing a change in variables. This step is a bit confusing, so let's start with some mathematical intuition. Let a given pixel component in the adversarial image be $x_i + \delta_i$ , where $x_i$ is the original value and $\delta_i$ is the adversarial perturbation we will add to that pixel component. Now, let $x_i + \delta_i$ be a function of $w_i$ which can be any positive or negative number ( $w_i \in \mathbb{R}$ ).

x_i + \delta_i = \frac{1}{2} (\tanh({w_i}) + 1)

Why may we want to think about $x_i + \delta_i$ this way? Well, if we graph $\frac{1}{2} (\tanh({w_i}) + 1)$ we can see that the equation is always between 0 and 1:

Now instead of optimizing $\delta$ in our questions above, we can optimize $w$ and guarantee that we will be left with a valid image.

Putting it all together:

Instead of writing:

\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) + c \cdot f(x + \delta) \\ \text{such that} \quad & x + \delta \in [0, 1]^n \end{align*}

We can say:

\mathrm{minimize} \ \ D(x, \frac{1}{2} (\tanh({w}) + 1)) + c \cdot f(\frac{1}{2} (\tanh({w}) + 1))

The authors use an $L_p$ norm so instead of saying $D(x, \frac{1}{2} (\tanh({w}) + 1))$ , we can say $\| \delta \|_p$ where $\delta = \frac{1}{2} (\tanh({w}) + 1) - x$ . So for our final equation, we have:

\mathrm{minimize} \ \ \| \frac{1}{2} (\tanh({w}) + 1) - x \|_p + c \cdot f(\frac{1}{2} (\tanh({w}) + 1))

This way we are able to:

Maximize the probability that $x + \delta$ results in misclassification.
Minimize the $L_p$ norm of $\delta$ , making our adversarial example less suspicious
Guarantee that $x + \delta$ is between 0 and 1 without any clipping.

Disclaimer: The specific attack for the $L_2$ situation that Carlini and Wagner use in the paper has the $L_2$ distance metric squared instead of the vanilla $L_p$ norm shown above. In the $L_\infty$ case, the norm looks a bit different also but conceptually is similar. One meta-level takeaway here is that good researchers think critically about specific tweaks they can make to their attack to make it more effective for whichever case they are optimizing for.

Final comments

If any of the math above is confusing to you, there is nothing to worry about. When you complete the coding exercises, everything should become more concrete. After you are finished with the coding exercises, we recommend you read back through this document to test your knowledge and make sure that you understand everything.

Give feedback on this section

NextSquare Attack

References

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. https://arxiv.org/abs/1412.6572

Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial Machine Learning at Scale. https://arxiv.org/abs/1611.01236

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2019). Towards Deep Learning Models Resistant to Adversarial Attacks. https://arxiv.org/abs/1706.06083

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. https://arxiv.org/abs/1312.6199