Constitutional AI: Harmlessness from AI Feedback
📄 Read the full paper on arXiv →
Abstract
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a set of principles, or a "constitution," that should govern AI behavior, along with a small number of examples used for few-shot prompting. We first use a helpful but potentially harmful RLHF-trained model to generate self-critique and revisions, then fine-tune the original model on revised responses. We then use the critiques and revisions to produce AI-generated preference labels for harmlessness, which we use to train a preference model. Finally, we RL train the model using the preference model for harmlessness, just as for helpfulness. We call this approach Constitutional AI (CAI).
Key Contributions
- Constitutional approach: Training AI using explicit principles rather than per-output human feedback
- RLAIF: Reinforcement Learning from AI Feedback as alternative to human labeling
- Self-critique: Having models critique and revise their own outputs
- Scalable oversight: Reducing dependence on human feedback for each example
Summary
Constitutional AI addresses a key challenge in RLHF: the need for extensive human feedback. Instead of having humans label every harmful output, the approach uses a set of written principles (the "constitution") that the AI uses to evaluate and improve its own responses.
The method has two main stages:
- Supervised learning stage: The model critiques and revises its own outputs based on constitutional principles
- RL stage: A preference model trained on AI-generated comparisons is used for reinforcement learning
The Constitution
The paper uses principles drawn from various sources including the UN Declaration of Human Rights, Apple's terms of service, and principles designed specifically for helpfulness and harmlessness. These principles guide the AI's self-evaluation.
Impact
Constitutional AI is the primary training methodology used for Claude. It represents a significant step toward scalable oversight, where AI systems can help supervise other AI systems according to human-specified principles.