Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains AI systems to align with human preferences by using human feedback as a reward signal. It has become a standard approach for training large language models like ChatGPT and Claude.
Overview
RLHF addresses a fundamental challenge in AI training: defining what "good" behavior looks like. Rather than hand-coding reward functions, RLHF learns what humans prefer by having them compare and rate AI outputs. This approach was pioneered by Paul Christiano and colleagues in their foundational 2017 paper.
How It Works
Step 1: Supervised Fine-Tuning
A base language model is fine-tuned on examples of desired behavior, typically human-written responses to prompts.
Step 2: Reward Model Training
Human evaluators compare pairs of AI responses and indicate which is better. These preferences train a reward model that predicts human preferences.
Step 3: Reinforcement Learning
The language model is further trained using reinforcement learning (typically PPO) to maximize the reward model's score while staying close to the original model.
This three-step process was formalized and deployed at scale in the InstructGPT paper, which demonstrated that RLHF could make language models significantly more helpful and safe.
Limitations
- Reward hacking: Models may find ways to score high on the reward model without actually being helpful
- Human evaluator limitations: Humans may not recognize subtle errors or manipulation
- Preference inconsistency: Different humans may have different preferences
- Scalability: Collecting human feedback is expensive and time-consuming
Alternatives and Extensions
- Constitutional AI - Uses AI feedback guided by principles, developed by Anthropic
- RLAIF - Reinforcement Learning from AI Feedback
- DPO - Direct Preference Optimization, a simpler alternative
Key Papers
- Christiano et al. (2017) - "Deep Reinforcement Learning from Human Preferences"
- Stiennon et al. (2020) - "Learning to Summarize from Human Feedback"
- Ouyang et al. (2022) - "Training language models to follow instructions with human feedback"