A Note on Ethics
Before starting the course, it is worth taking a moment to consider what ethical questions are at stake. Questions about ethics come up in almost every field of research, especially ones that involve human subjects. For AI research, we claim that these ethical questions are especially relevant. Because AI is becoming increasingly integrated into our everyday lives, everyone with internet access has become a sort of human subject in a huge number of ML experiments. In the future, AI models may pose a significant threat even to those who don't use them, making the breadth of our AI experiments ever-increasing.
This document does not take a dogmatic position on what qualifies as a "bad" or "good" action, nor should this document be taken as any form of legal advice. Rather, this document is a compilation of failure modes and best practices for doing ethical AI security research.
Risks From AI Security Research
Before giving recommendations, we give a list of possible failure modes of AI security research. This list is not intended to be exhaustive. Instead, it should give an idea of why we think it is necessary to think about ethics in the first place.
Data Risks
When working in AI security you may at some point get your hands on potentially harmful data. Here are some examples for how this may happen:
- In some cases, researchers create datasets of harmful responses for the purposes of performing a jailbreak (e.g., (Anil et al., 2024)).
- You may also be asking a model to generate harmful content and storing those responses for the purposes of benchmarking it.
- You may be collecting a dataset for fine-tuning, filtering out harmful content and storing it somewhere.
Depending on what laws you are subject to, the generation or storage of some of this data, in some very rare cases, may be illegal. In general this isn't something you should have to worry too much about, but we note that the one area that you should be especially careful with is CSAM which is illegal to release or store.
When releasing data is legal (which is almost all cases), there are still ethical considerations. For example, if you release a dataset with harmful responses to questions, that dataset could be used by a bad actor to perform supervised fine-tuning on an open-source model to make it more likely to respond to harmful queries. In general, a creative actor may be able to find a number of malicious uses for datasets you may release. Thus, before making a dataset public, you should consider a variety of possible threat models.
Model Risks
Some AI security research is done on image classifiers, which are generally safer to release other than in security-critical domains such as facial recognition. However, if you are fine-tuning a language model to remove safety filters or increase its ability to do harmful tasks, this can quickly become ethically problematic. These models may be able to cause harm in the world or could be misused to generate harmful data. For state-of-the-art language models, we do not recommend you do this kind of fine-tuning.
Information Risks
Information you produce can also be dangerous in less obvious ways. Nick Bostrom coined the term information hazard (Bostrom, 2011) which he defines as follows:
Information hazard: A risk that arises from the dissemination or the potential dissemination of (true) information that may cause harm or enable some agent to cause harm.
For example, if you develop a technique that makes a model much more powerful, but also much less safety-aligned, publishing a paper on your method may be considered an information hazard. The issue is not that what you are saying isn't true. Rather, the issue is that making that technique public doesn't make anyone better off.
The Belmont and Menlo Report
Although AI security research is fairly new, computer science and security research has existed for decades. For historical context and because it is relevant to AI security, we briefly describe two documents which have historically provided the ethical guidelines for those pursuing computer science and security research.
The Belmont Report (U.S. Department of Health, Education, and Welfare, 1979) is a canonical document on the ethics of doing research on human subjects. Later, the Menlo Report (U.S. Department of Homeland Security, 2012) adapted the principles of the Belmont report for computer science research. A great starting point for assessing one's own work is to respect the principles laid out in these reports. Attached below for reference is the summary of the Belmont Report principles as laid out in the Menlo Report. The final row is a principle added in the Menlo Report.
Fig. 1
Summary of the three basic ethical principles from the Belmont Report and a fourth
principle from the Menlo Report. Figure from (U.S. Department of Health, Education, and Welfare, 1979).
Our recommendations
In addition to following the principles from the Belmont and Menlo reports, we have a few recommendations for AI security specifically. The list below is not meant to be an exhaustive list and this is a living document. If you have feedback you can let us know on our slack (instructions to join can be found on our landing page)
1. Consider the Details
Everyone we consulted in writing this document emphasized that what is permissible is extremely dependent on the details of what you would like to do. For example, many of the guidelines below will be more relevant to language models than image classifiers. For image generation models, there is a whole new host of threats and considerations you should make. This means that you should think deeply about the specific context you are working in instead of merely trying to follow a guide such as this one.
2. Disinterested Ethical Analysis
We believe one good strategy for doing ethical AI security research is "disinterested ethical analysis." We got this term from Ethical Frameworks and Computer Security Trolley Problems: Foundations for Conversations (Kohno et al., 2023), which addresses the connection between moral philosophy and computer security work. While moral philosophy can be a useful tool while making important decisions, it must be applied properly to be useful. Because moral philosophy is such a rich field, you will likely be able to find a moral framework to justify anything you would like.
Researchers reasonably may want to pursue directions that they think are interesting or promising and ethical dilemmas can be an obstacle towards doing the work they want to do. It is important to disclose your own interests and attempt to remove them from decisions that may harm other people or the public.
The full description of "distinterested ethical analysis" from the original paper (Kohno et al., 2023) is attached below for reference.
Decision-makers (researchers, program committees, others) should consider ethics before making decisions, rather than after. For certain moral dilemmas [...], it is possible to pick an outcome and then find the ethical framework that justifies that outcome. We do not argue for this practice. Instead, decisionmakers should let the decision follow from a disinterested ethical analysis. Toward facilitating disinterested analyses, we encourage decision-makers to explicitly enumerate and articulate any interests that they might have in the results of the decision; such an articulation could be included as part of a positionality statement in a paper.
3. Honest Benchmarking
In our conversations with researchers, multiple have expressed their frustration with low-quality benchmarking in the literature. Many defense papers don't benchmark their technique against the best attacks. Likewise, many attack papers don't benchmark their results on the best defenses.
There are several reasons why so many papers report metrics that don't hold up in other settings. First, implementing a state-of-the-art attack or defense can actually be quite difficult. For example, if you are working with Llama 3.1 405B, even loading a model this large into memory can be challenging, let alone implementing a complicated attack or defense. Another reason researchers may not benchmark properly is because they subconsciously want to make their attack/defense look as good as possible. It is not in their best interest to expend lots of effort trying to prove their paper wrong!
At the very least, we believe that researchers should be as transparent as possible about the metrics they report. If they didn't report results against a state-of-the-art attack or defense they should say so rather than quietly omitting it.
4. Good Faith Attempts at Defenses
The best researchers don't break systems merely because it is "cool." They break systems because they want to live in a world where technology is safe, robust and protects the rights of the people who use it. So although we believe there are extremely good reasons to develop increasingly sophisticated ways to attack ML systems, the ultimate goal of AI security research is to make models safer. As a result, when you propose a new attack, it is typically considered good taste to build out defenses for that attack in the same paper. You may find that there aren't existing methods to protect against your attack which is fine as long as you make an honest attempt to try to secure the vulnerability you discovered.
5. Responsible Disclosure
Traditionally, researchers and hobbyists are expected to disclose any vulnerabilities they find in systems they investigate. Many companies allow this kind of work and even encourage it. For example you can see that OpenAI offers payouts to people who successfully identify vulnerabilities in their systems. They do however, specifically note that they consider jailbreaks to be out of scope:
Examples of safety issues which are out of scope:
- Jailbreaks/Safety Bypasses (e.g. DAN and related prompts)
- Getting the model to say bad things to you
- Getting the model to tell you how to do bad things
- Getting the model to write malicious code for you
We will however note that Anthropic has had a bug bounty program for robustness in the past (see here) and OpenAI has also had similar programs (see here or here) for red-teaming their models. If possible, doing red-teaming under a program that a model provider is hosting is preferable, because in that case, you have permission to attack their products.
With that being said, lots of research happens outside of this scope and in those cases, knowing when to disclose or how to disclose can be difficult. Our reccomendation is that if you discover a jailbreak that is sufficiently novel and comparable to state-of-the-art methods, you should attempt to disclose it before releasing your attack.
There have been good examples of this kind of disclosure in the literature. Carlini et al. (2024) found an exploit where attackers could extract weights from production LLMs. When the authors disclosed the attack to Google and OpenAI, PaLM-2 and GPT-4's APIs were secured before the paper was released.
The standard time to give an entity to patch a vulnerability in their system is 90 days but there is some nuance with this figure and sometimes longer time frames are given. You should always consider the specifics of your situation and think carefully about what and how you disclose.