Theories & Approaches

Technical approaches to ensuring AI systems remain beneficial and aligned with human values.

Corrigibility

The property of an AI system that allows operators to correct, modify, or shut it down without resistance.

Active Research
Interpretability

Techniques for understanding how AI systems make decisions and what they have learned.

Active Research
RLHF

Reinforcement Learning from Human Feedback - training AI systems using human preferences.

Deployed
Constitutional AI

Training AI systems to follow a set of principles without direct human feedback on each output.

Deployed
AI Safety via Debate

Using adversarial debates between AI systems to help humans judge complex questions.

Active Research
Iterated Amplification

Recursively training AI systems by decomposing hard problems into easier subproblems.

Theoretical
Cooperative IRL

Learning human preferences through collaborative interaction rather than passive observation.

Active Research
Recursive Reward Modeling

Using AI assistants to help humans provide better feedback for training AI systems.

Active Research
Process-Oriented Learning

Training AI to follow good reasoning processes rather than just produce correct outputs.

Active Research