Theories & Approaches
Technical approaches to ensuring AI systems remain beneficial and aligned with human values.
The property of an AI system that allows operators to correct, modify, or shut it down without resistance.
Techniques for understanding how AI systems make decisions and what they have learned.
Reinforcement Learning from Human Feedback - training AI systems using human preferences.
Training AI systems to follow a set of principles without direct human feedback on each output.
Using adversarial debates between AI systems to help humans judge complex questions.
Recursively training AI systems by decomposing hard problems into easier subproblems.
Learning human preferences through collaborative interaction rather than passive observation.
Using AI assistants to help humans provide better feedback for training AI systems.
Training AI to follow good reasoning processes rather than just produce correct outputs.