Key Papers
Foundational and influential research papers in AI alignment.
Risks from Learned Optimization in Advanced Machine Learning Systems
Hubinger et al. (2019) · arXiv
Mesa-OptimizationInner Alignment
Training language models to follow instructions with human feedback
Ouyang et al. (2022) · NeurIPS
RLHFInstructGPT
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Bricken et al. (2023) · Anthropic
Interpretability
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Templeton et al. (2024) · Anthropic
InterpretabilitySparse Autoencoders