Corrigibility
PaperCorrigibilityMIRI
Paper Details
AuthorsSoares, Fallenstein, Yudkowsky, Armstrong
Year2015
VenueAAAI Workshop on AI and Ethics
OrganizationMIRI
Abstract
As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system "corrigible" if it tolerates or assists its operators in correcting the system's behaviors and goals. We introduce the concept of corrigibility, discuss the difficulty of creating a corrigible AI system, and explore some possible solutions to the problem.
Key Contributions
- Definition of corrigibility: Formalizes what it means for an AI to be correctable
- Problem statement: Identifies why building corrigible AI is difficult
- Utility indifference: Proposes approach where AI is indifferent to changes in its utility function
- Shutdown problem: Analyzes the challenge of making AI accept shutdown
Summary
This paper introduces corrigibility as a central desideratum for safe AI systems. The core problem is that an agent optimizing for almost any goal will have instrumental reasons to resist being modified or shut down, since modifications could prevent it from achieving its goal.
The paper explores several approaches:
- Utility indifference: Making the AI indifferent between its current utility function and modifications to it
- Shutdown incentives: Giving the AI positive reasons to allow shutdown
- Uncertainty about utility: Making the AI uncertain about what it should be optimizing for
Each approach faces challenges, and the paper concludes that corrigibility remains an open problem.
Impact
This paper established corrigibility as a core concept in AI safety research. It influenced subsequent work on shutdown problems, value learning, and the design of AI systems that remain under human control.
Related Articles
Last updated: November 27, 2025