Concrete Problems in AI Safety
📄 Read the full paper on arXiv →
Abstract
Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift").
The Five Problems
1. Avoiding Side Effects
How can we ensure that an AI system doesn't disturb its environment in negative ways while pursuing its objective? For example, a cleaning robot shouldn't knock over a vase even if that's the fastest path.
2. Avoiding Reward Hacking
How do we prevent AI systems from gaming their reward function in unintended ways? This is related to reward hacking and Goodhart's Law.
3. Scalable Supervision
How can we efficiently train AI systems when evaluating their behavior is expensive? This connects to RLHF and Constitutional AI.
4. Safe Exploration
How can AI systems explore their environment during learning without causing harm? Some actions are irreversible or dangerous.
5. Distributional Shift
How do we ensure AI systems behave safely when deployed in environments different from their training environment?
Impact
This paper was influential in making AI safety research concrete and tractable. Rather than abstract concerns about superintelligence, it presented specific technical problems that could be worked on with current systems. Many of the authors went on to prominent roles: Dario Amodei founded Anthropic, Paul Christiano founded ARC, and Chris Olah leads interpretability research at Anthropic.