FGSM & PGD

This section has a series of coding problems using PyTorch. As always, we highly recommend you read all the content on this page before starting the coding exercises.

Creating Adversarial Images

This segment of the course focuses on generating adversarial samples to fool a convolutional neural network trained on the CIFAR-10 dataset. We will cover the Fast Gradient Sign Method (FGSM), its iterative variant (BIM), and Projected Gradient Descent (PGD). The Carlini-Wagner attack (Carlini & Wagner, 2017) and black box methods such as the Square Attack (Andriushchenko et al., 2020) will be covered in later pages as well.

Distances

There are a few metrics for calculating the 'distance' between the original image and the perturbed one, the most notable being L0L_0, L2L_2, and LL_{\infty}. L0L_0 is equivalent to the number of non-matching pixels, and is easy to calculate. L2L_2 is the typical norm used in linear algebra, and refers to the vector distance between the two images. LL_{\infty} calculates the maximum perturbation to any of the pixels in the original image.

For our attacks we will use LL_{\infty} because it is simple, cheap to calculate, and historically conventional for the kinds of attacks we are performing. It is also intuitive: a single dramatically changed pixel (for example, green to pink), would be easy to spot. Minimizing an LL_\infty metric, for example, to keep all changes within a 8/2558/255, is typically enough to prevent our adversarial images from becoming suspicious.

FGSM

Fast gradient sign method (FGSM) (Goodfellow et al., 2015) is a simple approach used to generate adversarial samples quickly. While the approach is efficient, it has the downside of having a lower chance of being effective.

To create an adversarial image using FGSM, there are only a few steps. Using the same loss function used to train the model, generate the loss with respect to the input image. Then, calculate the gradient of the loss function with respect to the input image data. Finally, adjust the original image based on the sign of its gradient.

x=x+ϵsign(lossF,t(x))x' = x + \epsilon \cdot \mathrm{sign}(\nabla \mathrm{loss}_{F,t}(x))

Intuitively, you are moving the image in a direction which increases the loss, making the model less accurate.

BIM

The Basic Iterative Method (BIM) (Kurakin et al., 2017) involves the same approach of calculating the signs of inputs, but instead, a few iterations of this is done with a smaller multiplicative parameter.

xi=clipϵ(xi1+αsign(lossF,t(xi1)))x'_i = \mathrm{clip}_\epsilon(x'_{i-1} + \alpha \cdot \mathrm{sign}(\nabla \mathrm{loss}_{F,t}(x'_{i-1})))

This should look similar to equations you have seen before for gradient descent, but instead of optimizing the weights of the model we are training, we are optimizing the input.

PGD

PGD is very similar to iterative FGSM, only differing by initializing random noise instead of starting with no perturbation. PGD continues to be used as a standard approach in research today. PGD is relatively easy to implement and efficient, making it a useful benchmark adopted by many researchers to test model robustness.

References

Andriushchenko, M., Croce, F., Flammarion, N., & Hein, M. (2020). Square Attack: a query-efficient black-box adversarial attack via random search. https://arxiv.org/abs/1912.00049
Carlini, N., & Wagner, D. (2017). Towards Evaluating the Robustness of Neural Networks. https://arxiv.org/abs/1608.04644
Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. https://arxiv.org/abs/1412.6572
Kurakin, A., Goodfellow, I., & Bengio, S. (2017). Adversarial Machine Learning at Scale. https://arxiv.org/abs/1611.01236