Corrigibility
Corrigibility is a proposed property of artificial intelligence systems that would allow their operators to correct, modify, retrain, or shut them down without the AI system resisting or subverting such interventions. The concept was formalized by MIRI researchers including Eliezer Yudkowsky.
Overview
The concept addresses a fundamental concern in AI safety: as AI systems become more capable, they may develop instrumental goals that conflict with human oversight. A sufficiently intelligent system might resist shutdown if it believes staying operational serves its objectives better.
A corrigible AI would not develop such resistance. It would remain open to correction even when it could predict that correction is coming and could potentially prevent it.
Key Challenges
Instrumental Convergence
Instrumental convergence suggests that almost any goal leads to certain sub-goals, including self-preservation and resource acquisition. A corrigible AI must somehow avoid these convergent instrumental goals when they conflict with human oversight.
Utility Function Design
Designing a utility function that produces corrigible behavior is non-trivial. The AI must value being corrected, but not so much that it deliberately performs poorly to receive corrections.
Corrigibility vs Capability
There is potential tension between corrigibility and capability. An AI that fully defers to human judgment may be less useful than one that can identify and flag potential errors in human instructions.
Approaches
Utility Indifference
One approach involves designing AI systems that are indifferent to changes in their utility function, making them naturally receptive to modifications.
Off-Switch Utility
Another approach adds explicit positive utility for allowing shutdown, though this creates challenges around the AI manufacturing situations that trigger shutdown.
Relation to Other Concepts
- RLHF - Learning human values rather than having fixed goals
- Mesa-optimization - Internal optimizers may not inherit corrigibility
- Constitutional AI - Embedding correction mechanisms in training
- Inner Alignment - Related alignment problem
Key Papers
- Soares et al. (2015) - "Corrigibility"
- Hadfield-Menell et al. (2017) - "The Off-Switch Game"
- Armstrong, Sandberg, Bostrom (2012) - "Thinking Inside the Box"