Key Papers

Foundational and influential research papers in AI alignment.

Superintelligence: Paths, Dangers, Strategies

Nick Bostrom (2014) · Oxford University Press

Existential RiskFoundational

Risks from Learned Optimization in Advanced Machine Learning Systems

Hubinger et al. (2019) · arXiv

Mesa-OptimizationInner Alignment

Deep Reinforcement Learning from Human Preferences

Christiano et al. (2017) · NeurIPS

RLHF

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (2022) · arXiv

Constitutional AIRLAIF

Training language models to follow instructions with human feedback

Ouyang et al. (2022) · NeurIPS

RLHFInstructGPT

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Bricken et al. (2023) · Anthropic

Interpretability

Soares et al. (2015) · AAAI Workshop

CorrigibilityValue Alignment

Concrete Problems in AI Safety

Amodei et al. (2016) · arXiv

AI SafetyResearch Agenda

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Templeton et al. (2024) · Anthropic

InterpretabilitySparse Autoencoders