Defensive Distillation

This section has a series of coding problems using PyTorch. As always, we highly recommend you read all the content on this page before starting the coding exercises.

In this section, you will learn about "defensive distillation," a technique to defend against adversarial attacks in computer vision. Although this defense has be broken (Carlini & Wagner, 2017), similar techniques with analogous motivations have pushed the state of the art forward (Bartoldson et al., 2024). In this document, we will explain distillation, then defensive distillation and the intuition behind the defense.

Distillation

Distillation, proposed by Hinton et al. (2015), is a technique to leverage a more complex and capible model to produce a more compute-efficent model that preserves the performace. The author's accomplish this by training the smaller model on the outputs of the capible model.

As a review, when we train an image classifer we usually use what are called "hard labels." For an image of a 2 in the MNIST dataset, the hard label would assign 100% probability to the "2" class and 0% probability to every other class. In distillation however, we use "soft labels" from a trained model which assign various positive probabilites to every class.

Why train on these soft labels? Aren't the hard labels a more accurate measure of ground truth? The original paper claims that these soft labels are useful because they give context to the underlying structure of the dataset. One example they give is for an image of a 2, a model may give 10610^{-6} probability of the image being a 3 and a 10910^{-9} probability of it being a 7. In another example, of a 2 you may find the reverse. The authors claim that this extracts more detailed infromation about the data. Rather than training a model to learn what is a 2 and what is not a 2, we can use these soft labels to also teach the model which 2s look more like 3s than they do 7s.

Adding Temperature

A traditional softmax is calculated via the following equation. When there are KK classes, and zz is the pre-softmax output, the equation below gives the probability for class ii:

qi=ezij=0K1ezjq_i = \frac{e^{z_i}}{\sum_{j=0}^{K-1}e^{z_j}}

If we want the output of the softmax to be smoother, we can add a constant TT which forces the distibution to be more uniform. Note that are traditional softmax is the equivalent to the equation below when T=1T = 1.

qi=ezi/Tj=0K1ezj/Tq_i = \frac{e^{z_i / T}}{\sum_{j=0}^{K-1}e^{z_j / T }}

Why add temperature? Smoother labels make distinctions between small probabilites more salient. Returning to our example of an MNIST image of a 2, you may get softmax probabilies as low as 10610^{-6} or 10910^{-9}. The difference between these values encodes real information, but if you train on these outputs, they are so close to zero that they barely influence the loss.

Defensive Distillation

In AI security we are less concerned with efficiency and more concerned with robustness. Therefore, unlike the variation proposed by (Hinton et al., 2015), defensive distillation, uses the same model archieture for both the original and distilled models.

The motivation behind using distillation as a defense is it produces a smoother model which generalizes better to inputs an epsilon distance away from clean images. In other words, gradients of the loss with respect to indiviudal pixel values should be small if a model is distilled properly. The authors say:

If adversarial gradients are high, crafting adversarial samples becomes easier because small perturbations will induce high DNN output variations. To defend against such perturbations, one must therefore reduce variations around the input, and consequently the amplitude of adversarial gradients

You can think about navigating up a mountain where climbing up in a certain direction increases the loss. When you are optimizing an adversarial example, you are trying to climb up as fast as possible. If a model is properly distilled however, it will be difficult to climb up because the landscape is mostly flat.

Swiss cheese security model

References

Bartoldson, B. R., Diffenderfer, J., Parasyris, K., & Kailkhura, B. (2024). Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies. https://arxiv.org/abs/2404.09349
Carlini, N., & Wagner, D. (2017). Towards Evaluating the Robustness of Neural Networks. https://arxiv.org/abs/1608.04644
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531