About AI Security
While RLHF, constitutional AI, and other methods are often effective at improving the safety of AI products, they aren't robust to a variety of edge cases. A single clever prompt or adversarial suffix can undo months of safety training, while fine-tuning open weight models can easily bypass safeguards that appeared robust upon release.
AI security research systematically exposes the gap between what AI models learn and what humans intend them to learn. Unlike other pillars of AI safety research which are pre-paradigm or theory-based, AI security research gives practical insights into how we can design safer models.