Training language models to follow instructions with human feedback
PaperRLHFOpenAI
Paper Details
AuthorsOuyang, Wu, Jiang, et al.
Year2022
VenueNeurIPS
Also Known AsInstructGPT Paper
📄 Read the full paper on arXiv →
Abstract
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback (RLHF).
Key Contributions
- InstructGPT: First major deployment of RLHF at scale for language models
- Human preference data: Large-scale collection of human rankings of model outputs
- Alignment improvements: Demonstrated that smaller aligned models can be preferred over larger unaligned ones
- Three-step process: Established the SFT → RM → PPO pipeline for RLHF
The InstructGPT Process
Step 1: Supervised Fine-Tuning (SFT)
Human labelers write high-quality responses to prompts. GPT-3 is fine-tuned on these demonstrations.
Step 2: Reward Model Training
Labelers rank multiple model outputs for the same prompt. These rankings train a reward model to predict human preferences.
Step 3: RL Fine-Tuning
The SFT model is further trained using PPO to maximize the reward model's score, with a KL penalty to prevent too much deviation from the SFT model.
Key Findings
- 1.3B InstructGPT was preferred to 175B GPT-3 despite being 100x smaller
- InstructGPT showed improvements in truthfulness and reductions in toxic output
- The approach generalized to held-out labelers and prompts
- Some "alignment tax" - slight regression on some NLP benchmarks
Impact
This paper established the template for training modern AI assistants. The InstructGPT methodology was the foundation for ChatGPT and influenced Claude's training. It demonstrated that RLHF could produce dramatically more useful and aligned language models.
Related Articles
Last updated: November 27, 2025