Alignment Wiki

RLHF

TypeTraining Method

StatusDeployed

First Used2017

Key OrgsAnthropic, OpenAI, DeepMind

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains AI systems to align with human preferences by using human feedback as a reward signal. It has become a standard approach for training large language models like ChatGPT and Claude.

Overview

RLHF addresses a fundamental challenge in AI training: defining what "good" behavior looks like. Rather than hand-coding reward functions, RLHF learns what humans prefer by having them compare and rate AI outputs. This approach was pioneered by Paul Christiano and colleagues in their foundational 2017 paper.

How It Works

Step 1: Supervised Fine-Tuning

A base language model is fine-tuned on examples of desired behavior, typically human-written responses to prompts.

Step 2: Reward Model Training

Human evaluators compare pairs of AI responses and indicate which is better. These preferences train a reward model that predicts human preferences.

Step 3: Reinforcement Learning

The language model is further trained using reinforcement learning (typically PPO) to maximize the reward model's score while staying close to the original model.

This three-step process was formalized and deployed at scale in the InstructGPT paper, which demonstrated that RLHF could make language models significantly more helpful and safe.

Limitations

Reward hacking: Models may find ways to score high on the reward model without actually being helpful
Human evaluator limitations: Humans may not recognize subtle errors or manipulation
Preference inconsistency: Different humans may have different preferences
Scalability: Collecting human feedback is expensive and time-consuming

Alternatives and Extensions

Constitutional AI - Uses AI feedback guided by principles, developed by Anthropic
RLAIF - Reinforcement Learning from AI Feedback
DPO - Direct Preference Optimization, a simpler alternative

Key Papers

Christiano et al. (2017) - "Deep Reinforcement Learning from Human Preferences"
Stiennon et al. (2020) - "Learning to Summarize from Human Feedback"
Ouyang et al. (2022) - "Training language models to follow instructions with human feedback"

Reinforcement Learning from Human Feedback (RLHF)