Cooperative Inverse Reinforcement Learning

TheoryValue Learning
Suggest Edit
Cooperative IRL
TypeValue Learning Approach
Proposed ByStuart Russell
Year2016
StatusActive Research
InstitutionUC Berkeley

Cooperative Inverse Reinforcement Learning (CIRL) is an alignment framework where the AI and human work together in a two-player game. The AI is uncertain about human values and learns them through interaction, while optimizing for the human's true (unknown) reward function.

Key Insight

Unlike standard IRL where the AI passively observes, CIRL treats value learning as a cooperative game. The AI has an incentive to:

  • Ask clarifying questions
  • Defer to humans when uncertain
  • Avoid irreversible actions until confident
  • Actively elicit information about preferences

Game Structure

In a CIRL game:

  • Human knows the reward function, AI doesn't
  • Both players want to maximize human's reward
  • Human's actions reveal information about preferences
  • Optimal AI strategy involves learning and deferring

Benefits Over Standard IRL

  • AI naturally becomes corrigible
  • Handles uncertainty about values gracefully
  • AI doesn't assume it knows better than humans
  • Incentivizes transparency and communication

Connection to Assistance Games

CIRL is part of a broader framework Stuart Russell calls "assistance games" or "human-compatible AI." The core principle: AI should be designed to maximize human preferences while being uncertain about what those preferences are.

Limitations

  • Assumes humans can provide good feedback
  • Doesn't address manipulation or deception
  • Scaling to complex real-world settings is hard
  • May be too deferential in some contexts

See Also

Last updated: November 28, 2025