Reinforcement Learning from Human Feedback (RLHF)

Training MethodDeployed Technique
Suggest Edit
RLHF
TypeTraining Method
StatusDeployed
First Used2017
Key OrgsAnthropic, OpenAI, DeepMind

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that trains AI systems to align with human preferences by using human feedback as a reward signal. It has become a standard approach for training large language models like ChatGPT and Claude.

Overview

RLHF addresses a fundamental challenge in AI training: defining what "good" behavior looks like. Rather than hand-coding reward functions, RLHF learns what humans prefer by having them compare and rate AI outputs. This approach was pioneered by Paul Christiano and colleagues in their foundational 2017 paper.

How It Works

Step 1: Supervised Fine-Tuning

A base language model is fine-tuned on examples of desired behavior, typically human-written responses to prompts.

Step 2: Reward Model Training

Human evaluators compare pairs of AI responses and indicate which is better. These preferences train a reward model that predicts human preferences.

Step 3: Reinforcement Learning

The language model is further trained using reinforcement learning (typically PPO) to maximize the reward model's score while staying close to the original model.

This three-step process was formalized and deployed at scale in the InstructGPT paper, which demonstrated that RLHF could make language models significantly more helpful and safe.

Limitations

  • Reward hacking: Models may find ways to score high on the reward model without actually being helpful
  • Human evaluator limitations: Humans may not recognize subtle errors or manipulation
  • Preference inconsistency: Different humans may have different preferences
  • Scalability: Collecting human feedback is expensive and time-consuming

Alternatives and Extensions

  • Constitutional AI - Uses AI feedback guided by principles, developed by Anthropic
  • RLAIF - Reinforcement Learning from AI Feedback
  • DPO - Direct Preference Optimization, a simpler alternative

Key Papers

See Also

Last updated: November 27, 2025