Key Papers
Foundational and influential research papers in AI alignment.
Superintelligence: Paths, Dangers, Strategies
Nick Bostrom (2014) 路 Oxford University Press
Existential RiskFoundational
Risks from Learned Optimization in Advanced Machine Learning Systems
Hubinger et al. (2019) 路 arXiv
Mesa-OptimizationInner Alignment
Training language models to follow instructions with human feedback
Ouyang et al. (2022) 路 NeurIPS
RLHFInstructGPT
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Bricken et al. (2023) 路 Anthropic
Interpretability
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton et al. (2024) 路 Anthropic
InterpretabilitySparse Autoencoders