AI Safety via Debate

TheoryScalable Oversight
Suggest Edit
AI Safety via Debate
TypeAlignment Approach
Proposed ByGeoffrey Irving, OpenAI
Year2018
StatusActive Research

AI Safety via Debate is an alignment approach where two AI systems argue opposing sides of a question, with a human judge determining the winner. The core insight is that it may be easier for humans to judge arguments than to generate correct answers directly.

Core Mechanism

In a debate setup:

  • Two AI debaters are given a question
  • Each debater argues for a different answer
  • Debaters can point out flaws in each other's arguments
  • A human judge decides which debater is more convincing
  • AI systems are trained to win debates by being truthful

Theoretical Foundation

The approach assumes that truth has a "natural advantage" in debateโ€”a debater arguing for a false position can always be caught by an opponent who knows the truth. This creates incentives for AI systems to be honest.

Advantages

  • Scales human oversight to complex questions
  • Adversarial setup catches deceptive arguments
  • Humans only need to judge, not solve problems
  • Can potentially handle superhuman questions

Limitations

  • May not work if both debaters collude
  • Humans might be persuaded by convincing but false arguments
  • Some questions may not have clear "sides"
  • Requires training stable debate equilibria

See Also

Last updated: November 28, 2025