Cooperative Inverse Reinforcement Learning
TheoryValue Learning
Cooperative IRL
TypeValue Learning Approach
Proposed ByStuart Russell
Year2016
StatusActive Research
InstitutionUC Berkeley
Cooperative Inverse Reinforcement Learning (CIRL) is an alignment framework where the AI and human work together in a two-player game. The AI is uncertain about human values and learns them through interaction, while optimizing for the human's true (unknown) reward function.
Key Insight
Unlike standard IRL where the AI passively observes, CIRL treats value learning as a cooperative game. The AI has an incentive to:
- Ask clarifying questions
- Defer to humans when uncertain
- Avoid irreversible actions until confident
- Actively elicit information about preferences
Game Structure
In a CIRL game:
- Human knows the reward function, AI doesn't
- Both players want to maximize human's reward
- Human's actions reveal information about preferences
- Optimal AI strategy involves learning and deferring
Benefits Over Standard IRL
- AI naturally becomes corrigible
- Handles uncertainty about values gracefully
- AI doesn't assume it knows better than humans
- Incentivizes transparency and communication
Connection to Assistance Games
CIRL is part of a broader framework Stuart Russell calls "assistance games" or "human-compatible AI." The core principle: AI should be designed to maximize human preferences while being uncertain about what those preferences are.
Limitations
- Assumes humans can provide good feedback
- Doesn't address manipulation or deception
- Scaling to complex real-world settings is hard
- May be too deferential in some contexts
See Also
Last updated: November 28, 2025