Carlini-Wagner Attacks

This section has a series of coding problems using PyTorch. To run the code locally, you can follow the installation instructions at the bottom of this page. As always, we highly recommend you read all the content on this page before starting the coding exercises.

Relevant Background

The Carlini-Wagner attack -- also known as CW -- was developed by Nicolas Carlini and David Wagner to improve upon established attack methods such as those you implemented in the previous section.

Because the CW attack method is much more sophisticated than anything you looked at in the previous section, we provide some background context before diving into the specifics of the attack.

Targeted vs Untargeted Attacks

In the previous section, you implemented FGSM (Goodfellow et al., 2015), Basic Iterative Method (Kurakin et al., 2017), and PGD (Madry et al., 2019) attacks. The code you wrote for each of these attacks would be considered an untargeted attack because you weren't trying to target any particular class for misclassification; you were just trying to get the model to predict the wrong answer. In a targeted attack, however, the attacker aims to get the model to predict a specific incorrect class.

Note that it is possible (and not too difficult) to write a targeted version of FGSM, ISGM, and PGD. We don't cover these variations in this course, but understanding CW will give you some solid intuition for what those attacks would look like. As an exercise, you may choose to implement these other targeted attacks on your own.

Potential issues with PGD

It isn't actually clear when, if ever, CW attacks are a better choice in a research context than a smart implementation of PGD. While we won't take a dogmatic position on this topic, we will recommend that when doing research, PGD or one of its variants is a good place to start. Either way, we believe that having a deep understanding of CW attacks will give you insight into a number of important considerations that go into attack design.

With all that being said, here are some of the issues with PGD and similar methods that motivate the Carlini-Wagner attack.

  1. The epsilon clipping operation in PGD isn't differentiable. This is an issue because it can disrupt optimization. Modern optimizers can do things like update based on previous gradients, and by adding a nondifferentiable step at every update, the logic of the optimizer is no longer consistent.
  2. Likewise, there isn't an effective way to ensure that an image is valid (all pixel components are between zero and one) because clipping the tensor between zero and one is not differentiable.

How Does the Carlini-Wagner Attack Work?

The Carlini-Wagner attack describes a family of targeted attacks for generating adversarial examples for image models. The authors propose a L0L_0, L2L_2 and LL_\infty attacks. For this page and the coding exercises, we focus on the L2L_2 attack. The attack design for the LL_\infty attack is clever and quite similar, but we will leave it to the reader to explore this more by reading the original paper if interested. The attack for L0L_0 is more complicated and less influential or important to understand, so only if you are especially interested should you explore the L0L_0 attack further.

The authors begin with the basic formalization of adversarial examples from (Szegedy et al., 2014). This represents a targeted attack where the function CC returns a classification and where tt is the target class. The function DD represents a distance metric while x+δ[0,1]nx + \delta \in [0, 1]^n constrains the adversarial image to be between zero and one.

minimizeD(x,x+δ)such thatC(x+δ)=tx+δ[0,1]n\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) \\ \text{such that} \quad & C(x + \delta) = t \\ & x + \delta \in [0, 1]^n \end{align*}

This formalization is slightly different from your implementations in the previous section, but the main idea should be familiar.

Change #1: Making the Classification Constraint Differentiable

The first change that Carlini and Wagner make to this objective is to make the requirement of C(x+δ)=tC(x + \delta) = t differentiable. By default, C(x+δ)C(x + \delta) returns an integer that represents a class. This function is not even continuous and certainly not differentiable. The authors reason that if you have a function f(x+δ)f(x + \delta) which is differentiable and positive only if C(x+δ)=tC(x + \delta) = t, then you could make f(x+δ)f(x + \delta) a term in the loss and then minimize it with SGD or Adam (more on this in the section below).

If f(x+δ)0f(x + \delta) \leq 0 implies that the model predicts x+δx + \delta to belong to class tt then we can now change our original equation to the one below.

minimizeD(x,x+δ)such thatf(x+δ)0x+δ[0,1]n\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) \\ \text{such that} \quad & f(x + \delta) \leq 0 \\ & x + \delta \in [0, 1]^n \end{align*}

Change #2: Adding Misclassification to the Loss

Above, we mentioned that we can add ff as a component of the loss we want to minimize. Let's make that more concrete with a specific example of ff.

In the paper, Carlini and Wager propose seven possible choices for ff. Below is the fourth option they offer, where F(x+δ)tF(x + \delta)_t is the softmax probability for the target class when the adversarial example is given to the model.

f4(x+δ)=ReLU(0.5F(x+δ)t)f_4(x + \delta) = \mathrm{ReLU}(0.5 - F(x + \delta)_t)

If f4(x+δ)f_4(x + \delta) is zero, that means that the softmax probability for the target class is greater than 50%, which means that the model must predict it. If f4(x+δ)f_4(x + \delta) is positive, which means that the model has not yet confidently predicted the adversarial image as the target class. Therefore, we can treat f4(x+δ)f_4(x + \delta) as a loss term we want to minimize. As a sidenote, it turns out that f4f_4 is actually quite ineffective compared to other choices for ff. In the coding exercises, you will explore this further.

Using ff as a loss term, we can change our previous equation to the below where cc weights how much ff contributes to the loss.

minimizeD(x,x+δ)+cf(x+δ)such thatx+δ[0,1]n\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) + c \cdot f(x + \delta) \\ \text{such that} \quad & x + \delta \in [0, 1]^n \end{align*}

Note that cc will be positive. A lower cc encourages D(x,x+δ)D(x, x + \delta) to be lower, making the adversarial image more similar to the original. A higher cc increases the probability that the attack is successful. In the example below, you can see some results we got from optimizing the above equation for different cc values with equation f6f_6 from the original paper and the L2L_2 distance metric. As cc gets larger, the probability of attack success rises, but the image becomes increasingly suspicious.

Different CW results depending on choice of c

Change #3: Change of Variables

The next issue to deal with is the box constraint: how do we keep the images between 0 and 1? The authors deal with this by introducing a change in variables. This step is a bit confusing, so let's start with some mathematical intuition. Let a given pixel component in the adversarial image be xi+δix_i + \delta_i, where xix_i is the original value and δi\delta_i is the adversarial perturbation we will add to that pixel component. Now, let xi+δix_i + \delta_i be a function of wiw_i which can be any positive or negative number (wiRw_i \in \mathbb{R}).

xi+δi=12(tanh(wi)+1)x_i + \delta_i = \frac{1}{2} (\tanh({w_i}) + 1)

Why may we want to think about xi+δix_i + \delta_i this way? Well, if we graph 12(tanh(wi)+1)\frac{1}{2} (\tanh({w_i}) + 1) we can see that the equation is always between 0 and 1:

Different CW results depending on choice of c

Now instead of optimizing δ\delta in our questions above, we can optimize ww and guarantee that we will be left with a valid image.

Putting it all together:

Instead of writing:

minimizeD(x,x+δ)+cf(x+δ)such thatx+δ[0,1]n\begin{align*} \mathrm{minimize} \quad & D(x, x + \delta) + c \cdot f(x + \delta) \\ \text{such that} \quad & x + \delta \in [0, 1]^n \end{align*}

We can say:

minimize  D(x,12(tanh(w)+1))+cf(12(tanh(w)+1))\mathrm{minimize} \ \ D(x, \frac{1}{2} (\tanh({w}) + 1)) + c \cdot f(\frac{1}{2} (\tanh({w}) + 1))

The authors use an LpL_p norm so instead of saying D(x,12(tanh(w)+1))D(x, \frac{1}{2} (\tanh({w}) + 1)), we can say δp\| \delta \|_p where δ=12(tanh(w)+1)x\delta = \frac{1}{2} (\tanh({w}) + 1) - x. So for our final equation, we have:

minimize  12(tanh(w)+1)xp+cf(12(tanh(w)+1))\mathrm{minimize} \ \ \| \frac{1}{2} (\tanh({w}) + 1) - x \|_p + c \cdot f(\frac{1}{2} (\tanh({w}) + 1))

This way we are able to:

  1. Maximize the probability that x+δx + \delta results in misclassification.
  2. Minimize the LpL_p norm of δ\delta, making our adversarial example less suspicious
  3. Guarantee that x+δx + \delta is between 0 and 1 without any clipping.

Disclaimer: The specific attack for the L2L_2 situation that Carlini and Wagner use in the paper has the L2L_2 distance metric squared instead of the vanilla LpL_p norm shown above. In the LL_\infty case, the norm looks a bit different also but conceptually is similar. One meta-level takeaway here is that good researchers think critically about specific tweaks they can make to their attack to make it more effective for whichever case they are optimizing for.

Final comments

If any of the math above is confusing to you, there is nothing to worry about. When you complete the coding exercises, everything should become more concrete. After you are finished with the coding exercises, we recommend you read back through this document to test your knowledge and make sure that you understand everything.

References

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. https://arxiv.org/abs/1412.6572
Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial Machine Learning at Scale. https://arxiv.org/abs/1611.01236
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2019). Towards Deep Learning Models Resistant to Adversarial Attacks. https://arxiv.org/abs/1706.06083
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. https://arxiv.org/abs/1312.6199