AI Safety via Debate
TheoryScalable Oversight
AI Safety via Debate
TypeAlignment Approach
Proposed ByGeoffrey Irving, OpenAI
Year2018
StatusActive Research
RelatedScalable Oversight
AI Safety via Debate is an alignment approach where two AI systems argue opposing sides of a question, with a human judge determining the winner. The core insight is that it may be easier for humans to judge arguments than to generate correct answers directly.
Core Mechanism
In a debate setup:
- Two AI debaters are given a question
- Each debater argues for a different answer
- Debaters can point out flaws in each other's arguments
- A human judge decides which debater is more convincing
- AI systems are trained to win debates by being truthful
Theoretical Foundation
The approach assumes that truth has a "natural advantage" in debateโa debater arguing for a false position can always be caught by an opponent who knows the truth. This creates incentives for AI systems to be honest.
Advantages
- Scales human oversight to complex questions
- Adversarial setup catches deceptive arguments
- Humans only need to judge, not solve problems
- Can potentially handle superhuman questions
Limitations
- May not work if both debaters collude
- Humans might be persuaded by convincing but false arguments
- Some questions may not have clear "sides"
- Requires training stable debate equilibria
See Also
External Sources
Last updated: November 28, 2025