Constitutional AI: Harmlessness from AI Feedback

PaperConstitutional AIAnthropic
Suggest Edit
Paper Details
AuthorsBai, Kadavath, Kundu, et al.
Year2022
VenuearXiv
OrganizationAnthropic

📄 Read the full paper on arXiv →

Abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a set of principles, or a "constitution," that should govern AI behavior, along with a small number of examples used for few-shot prompting. We first use a helpful but potentially harmful RLHF-trained model to generate self-critique and revisions, then fine-tune the original model on revised responses. We then use the critiques and revisions to produce AI-generated preference labels for harmlessness, which we use to train a preference model. Finally, we RL train the model using the preference model for harmlessness, just as for helpfulness. We call this approach Constitutional AI (CAI).

Key Contributions

  • Constitutional approach: Training AI using explicit principles rather than per-output human feedback
  • RLAIF: Reinforcement Learning from AI Feedback as alternative to human labeling
  • Self-critique: Having models critique and revise their own outputs
  • Scalable oversight: Reducing dependence on human feedback for each example

Summary

Constitutional AI addresses a key challenge in RLHF: the need for extensive human feedback. Instead of having humans label every harmful output, the approach uses a set of written principles (the "constitution") that the AI uses to evaluate and improve its own responses.

The method has two main stages:

  • Supervised learning stage: The model critiques and revises its own outputs based on constitutional principles
  • RL stage: A preference model trained on AI-generated comparisons is used for reinforcement learning

The Constitution

The paper uses principles drawn from various sources including the UN Declaration of Human Rights, Apple's terms of service, and principles designed specifically for helpfulness and harmlessness. These principles guide the AI's self-evaluation.

Impact

Constitutional AI is the primary training methodology used for Claude. It represents a significant step toward scalable oversight, where AI systems can help supervise other AI systems according to human-specified principles.

Related Articles

Last updated: November 27, 2025