Recursive Reward Modeling
TheoryScalable Oversight
Recursive Reward Modeling
Recursive Reward Modeling (RRM) is an approach to scalable oversight that uses AI assistants to help humans provide better feedback for training AI systems.
Core Concept
The key insight is that humans can provide higher-quality feedback when assisted by AI. This creates a virtuous cycle:
- Train AI assistant using human feedback
- Use AI assistant to help humans evaluate harder tasks
- Better evaluations enable training on harder tasks
- Improved AI becomes better assistant for evaluation
Relationship to RLHF
RRM extends RLHF to handle tasks where humans struggle to provide good feedback directly. Instead of replacing human judgment, AI assists human judgment to scale to more complex domains.
Key Components
- Reward Model: Trained to predict human preferences
- Assistant: Helps humans understand and evaluate outputs
- Recursion: Each iteration improves both components
Advantages
- Maintains human oversight while scaling
- Builds on proven RLHF techniques
- Addresses limitations of unaided human feedback
- Natural path to more capable aligned systems
Challenges
- AI assistance might bias human judgments
- Errors could compound across iterations
- Requires careful design of assistance interface
- Hard to verify alignment is preserved
See Also
Last updated: November 28, 2025