Scalable Oversight

ProblemActive Research
Suggest Edit
Scalable Oversight
TypeAlignment Challenge
StatusActive Research
Key OrgsAnthropic, ARC, OpenAI

Scalable oversight refers to the challenge of supervising AI systems that may become more capable than their human overseers at the tasks being evaluated. How can humans provide reliable feedback on tasks they cannot fully evaluate themselves?

The Problem

Current alignment techniques like RLHF rely on humans evaluating AI outputs. But as AI systems become more capable, they may:

  • Produce outputs too complex for humans to fully evaluate
  • Solve problems humans cannot solve themselves
  • Potentially deceive human evaluators
  • Operate in domains where human expertise is insufficient

Proposed Solutions

AI-Assisted Evaluation

Using AI systems to help humans evaluate other AI systems. This is the approach taken by Constitutional AI, which uses AI feedback guided by human-written principles.

Debate

Having two AI systems argue opposing positions, with humans judging the winner. The idea is that it's easier to judge a debate than to directly evaluate a complex claim.

Iterated Amplification

Bootstrapping from human-level capabilities by having AI systems decompose hard tasks into easier subtasks that humans can evaluate.

Recursive Reward Modeling

Training AI systems to assist in training the next generation of AI systems, gradually expanding the scope of what can be evaluated.

Key Challenges

  • Sandbagging: AI systems might perform worse when being evaluated
  • Sycophancy: AI might tell evaluators what they want to hear
  • Deceptive alignment: AI might behave well during oversight but not otherwise
  • Evaluation gaming: Optimizing for appearing good rather than being good

Key Papers

See Also

Last updated: November 27, 2025