Alignment Wiki

Corrigibility

TypeSafety Property

StatusActive Research

Key Papers3

Corrigibility is a proposed property of artificial intelligence systems that would allow their operators to correct, modify, retrain, or shut them down without the AI system resisting or subverting such interventions. The concept was formalized by MIRI researchers including Eliezer Yudkowsky.

Overview

The concept addresses a fundamental concern in AI safety: as AI systems become more capable, they may develop instrumental goals that conflict with human oversight. A sufficiently intelligent system might resist shutdown if it believes staying operational serves its objectives better.

A corrigible AI would not develop such resistance. It would remain open to correction even when it could predict that correction is coming and could potentially prevent it.

Key Challenges

Instrumental Convergence

Instrumental convergence suggests that almost any goal leads to certain sub-goals, including self-preservation and resource acquisition. A corrigible AI must somehow avoid these convergent instrumental goals when they conflict with human oversight.

Utility Function Design

Designing a utility function that produces corrigible behavior is non-trivial. The AI must value being corrected, but not so much that it deliberately performs poorly to receive corrections.

Corrigibility vs Capability

There is potential tension between corrigibility and capability. An AI that fully defers to human judgment may be less useful than one that can identify and flag potential errors in human instructions.

Approaches

Utility Indifference

One approach involves designing AI systems that are indifferent to changes in their utility function, making them naturally receptive to modifications.

Off-Switch Utility

Another approach adds explicit positive utility for allowing shutdown, though this creates challenges around the AI manufacturing situations that trigger shutdown.

Relation to Other Concepts

RLHF - Learning human values rather than having fixed goals
Mesa-optimization - Internal optimizers may not inherit corrigibility
Constitutional AI - Embedding correction mechanisms in training
Inner Alignment - Related alignment problem

Key Papers

Soares et al. (2015) - "Corrigibility"
Hadfield-Menell et al. (2017) - "The Off-Switch Game"
Armstrong, Sandberg, Bostrom (2012) - "Thinking Inside the Box"