Deep Reinforcement Learning from Human Preferences
PaperRLHFFoundational
Paper Details
AuthorsChristiano, Leike, Brown, Martic, Legg, Amodei
Year2017
VenueNeurIPS
Citations2000+
📄 Read the full paper on arXiv →
Abstract
For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than 1% of the agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time.
Key Contributions
- RLHF framework: Introduces reinforcement learning from human feedback as a practical technique
- Preference learning: Shows that comparing trajectory pairs is more efficient than rating individual outcomes
- Reward modeling: Demonstrates learning a reward model from human preferences
- Sample efficiency: Achieves results with minimal human feedback (less than 1% of interactions)
Summary
This paper introduces the core RLHF methodology that would later become standard for training language models. The key insight is that humans can compare two behaviors more reliably than they can assign absolute scores, and these comparisons can train a reward model.
The approach has three main components:
- An agent that learns from a reward signal
- A reward predictor trained on human preferences
- A process for collecting human comparisons
Impact
This paper is foundational to modern AI alignment practice. The RLHF technique it introduces was later used to train ChatGPT, Claude, and other major language models.Paul Christiano, the lead author, remains a central figure in alignment research.
Related Articles
Last updated: November 27, 2025