Interpretability
Interpretability (also called mechanistic interpretability) is a field of AI safety research focused on understanding how neural networks work internally - what they learn, how they represent information, and why they produce specific outputs.
Overview
Modern AI systems, particularly large language models, are often described as "black boxes" - we know what goes in and what comes out, but not what happens in between. Interpretability research aims to open these black boxes.
Understanding AI internals is crucial for alignment because it allows us to verify that models are reasoning in the ways we expect, detect deceptive or manipulative behavior, and identify potential failure modes before deployment. This connects to problems like inner alignment and mesa-optimization.
Key Approaches
Mechanistic Interpretability
Reverse-engineering neural networks to understand the algorithms they implement. This involves identifying circuits, features, and computational patterns within the network. Anthropic's interpretability team has been a leader in this area.
Probing
Training small classifiers on model activations to determine what information is represented at different layers.
Activation Patching
Intervening on specific activations to determine their causal role in producing outputs.
Sparse Autoencoders
Decomposing model activations into interpretable features that correspond to human-understandable concepts. This technique is central to the Towards Monosemanticity and Scaling Monosemanticity papers.
Key Discoveries
- Induction heads: Circuits that implement in-context learning
- Superposition: Models represent more features than they have dimensions
- Feature splitting: Features become more specific at larger scales
Challenges
- Scale: Modern models have billions of parameters
- Superposition: Features are not cleanly separated
- Polysemanticity: Individual neurons respond to multiple concepts
Key Papers
- Olah et al. (2020) - "Zoom In: An Introduction to Circuits"
- Elhage et al. (2021) - "A Mathematical Framework for Transformer Circuits"
- Bricken et al. (2023) - "Towards Monosemanticity"
- Templeton et al. (2024) - "Scaling Monosemanticity"