Theories & Approaches

Technical approaches to ensuring AI systems remain beneficial and aligned with human values.

The property of an AI system that allows operators to correct, modify, or shut it down without resistance.

Active Research

Techniques for understanding how AI systems make decisions and what they have learned.

Active Research

Reinforcement Learning from Human Feedback - training AI systems using human preferences.

Deployed

Training AI systems to follow a set of principles without direct human feedback on each output.

Deployed

Using adversarial debates between AI systems to help humans judge complex questions.

Active Research

Recursively training AI systems by decomposing hard problems into easier subproblems.

Theoretical

Learning human preferences through collaborative interaction rather than passive observation.

Active Research

Using AI assistants to help humans provide better feedback for training AI systems.

Active Research

Training AI to follow good reasoning processes rather than just produce correct outputs.

Active Research