Corrigibility

Alignment TheorySafety Property
Suggest Edit
Corrigibility
TypeSafety Property
StatusActive Research
Key Papers3

Corrigibility is a proposed property of artificial intelligence systems that would allow their operators to correct, modify, retrain, or shut them down without the AI system resisting or subverting such interventions. The concept was formalized by MIRI researchers including Eliezer Yudkowsky.

Overview

The concept addresses a fundamental concern in AI safety: as AI systems become more capable, they may develop instrumental goals that conflict with human oversight. A sufficiently intelligent system might resist shutdown if it believes staying operational serves its objectives better.

A corrigible AI would not develop such resistance. It would remain open to correction even when it could predict that correction is coming and could potentially prevent it.

Key Challenges

Instrumental Convergence

Instrumental convergence suggests that almost any goal leads to certain sub-goals, including self-preservation and resource acquisition. A corrigible AI must somehow avoid these convergent instrumental goals when they conflict with human oversight.

Utility Function Design

Designing a utility function that produces corrigible behavior is non-trivial. The AI must value being corrected, but not so much that it deliberately performs poorly to receive corrections.

Corrigibility vs Capability

There is potential tension between corrigibility and capability. An AI that fully defers to human judgment may be less useful than one that can identify and flag potential errors in human instructions.

Approaches

Utility Indifference

One approach involves designing AI systems that are indifferent to changes in their utility function, making them naturally receptive to modifications.

Off-Switch Utility

Another approach adds explicit positive utility for allowing shutdown, though this creates challenges around the AI manufacturing situations that trigger shutdown.

Relation to Other Concepts

Key Papers

See Also

Last updated: November 27, 2025