Key Papers

Foundational and influential research papers in AI alignment.

Superintelligence: Paths, Dangers, Strategies
Nick Bostrom (2014) 路 Oxford University Press
Existential RiskFoundational
Risks from Learned Optimization in Advanced Machine Learning Systems
Hubinger et al. (2019) 路 arXiv
Mesa-OptimizationInner Alignment
Deep Reinforcement Learning from Human Preferences
Christiano et al. (2017) 路 NeurIPS
RLHF
Constitutional AI: Harmlessness from AI Feedback
Bai et al. (2022) 路 arXiv
Constitutional AIRLAIF
Training language models to follow instructions with human feedback
Ouyang et al. (2022) 路 NeurIPS
RLHFInstructGPT
Corrigibility
Soares et al. (2015) 路 AAAI Workshop
CorrigibilityValue Alignment
Concrete Problems in AI Safety
Amodei et al. (2016) 路 arXiv
AI SafetyResearch Agenda
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton et al. (2024) 路 Anthropic
InterpretabilitySparse Autoencoders