Alignment Wiki

Paper Details

AuthorsOuyang, Wu, Jiang, et al.

Year2022

VenueNeurIPS

Also Known AsInstructGPT Paper

Abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback (RLHF).

Key Contributions

InstructGPT: First major deployment of RLHF at scale for language models
Human preference data: Large-scale collection of human rankings of model outputs
Alignment improvements: Demonstrated that smaller aligned models can be preferred over larger unaligned ones
Three-step process: Established the SFT → RM → PPO pipeline for RLHF

The InstructGPT Process

Step 1: Supervised Fine-Tuning (SFT)

Human labelers write high-quality responses to prompts. GPT-3 is fine-tuned on these demonstrations.

Step 2: Reward Model Training

Labelers rank multiple model outputs for the same prompt. These rankings train a reward model to predict human preferences.

Step 3: RL Fine-Tuning

The SFT model is further trained using PPO to maximize the reward model's score, with a KL penalty to prevent too much deviation from the SFT model.

Key Findings

1.3B InstructGPT was preferred to 175B GPT-3 despite being 100x smaller
InstructGPT showed improvements in truthfulness and reductions in toxic output
The approach generalized to held-out labelers and prompts
Some "alignment tax" - slight regression on some NLP benchmarks

Impact

This paper established the template for training modern AI assistants. The InstructGPT methodology was the foundation for ChatGPT and influenced Claude's training. It demonstrated that RLHF could produce dramatically more useful and aligned language models.

Training language models to follow instructions with human feedback