Key Papers

Foundational and influential research papers in AI alignment.

Risks from Learned Optimization in Advanced Machine Learning Systems
Hubinger et al. (2019) · arXiv
Mesa-OptimizationInner Alignment
Deep Reinforcement Learning from Human Preferences
Christiano et al. (2017) · NeurIPS
RLHF
Constitutional AI: Harmlessness from AI Feedback
Bai et al. (2022) · arXiv
Constitutional AIRLAIF
Training language models to follow instructions with human feedback
Ouyang et al. (2022) · NeurIPS
RLHFInstructGPT
Corrigibility
Soares et al. (2015) · AAAI Workshop
CorrigibilityValue Alignment
Concrete Problems in AI Safety
Amodei et al. (2016) · arXiv
AI SafetyResearch Agenda
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton et al. (2024) · Anthropic
InterpretabilitySparse Autoencoders