Scalable Oversight
Scalable oversight refers to the challenge of supervising AI systems that may become more capable than their human overseers at the tasks being evaluated. How can humans provide reliable feedback on tasks they cannot fully evaluate themselves?
The Problem
Current alignment techniques like RLHF rely on humans evaluating AI outputs. But as AI systems become more capable, they may:
- Produce outputs too complex for humans to fully evaluate
- Solve problems humans cannot solve themselves
- Potentially deceive human evaluators
- Operate in domains where human expertise is insufficient
Proposed Solutions
AI-Assisted Evaluation
Using AI systems to help humans evaluate other AI systems. This is the approach taken by Constitutional AI, which uses AI feedback guided by human-written principles.
Debate
Having two AI systems argue opposing positions, with humans judging the winner. The idea is that it's easier to judge a debate than to directly evaluate a complex claim.
Iterated Amplification
Bootstrapping from human-level capabilities by having AI systems decompose hard tasks into easier subtasks that humans can evaluate.
Recursive Reward Modeling
Training AI systems to assist in training the next generation of AI systems, gradually expanding the scope of what can be evaluated.
Key Challenges
- Sandbagging: AI systems might perform worse when being evaluated
- Sycophancy: AI might tell evaluators what they want to hear
- Deceptive alignment: AI might behave well during oversight but not otherwise
- Evaluation gaming: Optimizing for appearing good rather than being good
Key Papers
- Amodei et al. (2016) - "Concrete Problems in AI Safety"
- Christiano et al. (2018) - "Supervising Strong Learners by Amplifying Weak Experts"
- Irving et al. (2018) - "AI Safety via Debate"