Alignment Wiki

Constitutional AI

TypeTraining Method

StatusDeployed

Year2022

Constitutional AI (CAI) is an alignment technique developed by Anthropic that trains AI systems to follow a set of principles (a "constitution") using AI-generated feedback rather than human feedback for each output.

Overview

Constitutional AI addresses scalability limitations of RLHF by using the AI itself to evaluate outputs against a set of written principles. This reduces the need for extensive human feedback while maintaining alignment with human values.

How It Works

The Constitution

A set of principles that define desired AI behavior. These might include rules about being helpful, harmless, and honest, as well as specific guidelines about refusing harmful requests.

Critique and Revision

The AI generates a response, then critiques its own response according to the constitutional principles, and revises it to better align with those principles.

Reinforcement Learning from AI Feedback (RLAIF)

A reward model is trained on the AI's own preference judgments (based on the constitution) rather than human preferences, then used for RL training.

Advantages

Scalability: Reduces need for human feedback on every example
Transparency: Principles are explicit and can be examined
Consistency: AI feedback is more consistent than human feedback
Harmful content handling: AI can evaluate harmful content without exposing humans to it

Limitations

Principle specification: Difficult to write principles that cover all cases
AI judgment errors: AI may misinterpret or misapply principles
Bootstrapping problem: Relies on AI already having some alignment

Key Papers

Bai et al. (2022) - "Constitutional AI: Harmlessness from AI Feedback"