Interpretability

Research AreaActive Research
Suggest Edit
Interpretability
TypeResearch Area
StatusActive Research
Key OrgsAnthropic, DeepMind, OpenAI

Interpretability (also called mechanistic interpretability) is a field of AI safety research focused on understanding how neural networks work internally - what they learn, how they represent information, and why they produce specific outputs.

Overview

Modern AI systems, particularly large language models, are often described as "black boxes" - we know what goes in and what comes out, but not what happens in between. Interpretability research aims to open these black boxes.

Understanding AI internals is crucial for alignment because it allows us to verify that models are reasoning in the ways we expect, detect deceptive or manipulative behavior, and identify potential failure modes before deployment. This connects to problems like inner alignment and mesa-optimization.

Key Approaches

Mechanistic Interpretability

Reverse-engineering neural networks to understand the algorithms they implement. This involves identifying circuits, features, and computational patterns within the network. Anthropic's interpretability team has been a leader in this area.

Probing

Training small classifiers on model activations to determine what information is represented at different layers.

Activation Patching

Intervening on specific activations to determine their causal role in producing outputs.

Sparse Autoencoders

Decomposing model activations into interpretable features that correspond to human-understandable concepts. This technique is central to the Towards Monosemanticity and Scaling Monosemanticity papers.

Key Discoveries

  • Induction heads: Circuits that implement in-context learning
  • Superposition: Models represent more features than they have dimensions
  • Feature splitting: Features become more specific at larger scales

Challenges

  • Scale: Modern models have billions of parameters
  • Superposition: Features are not cleanly separated
  • Polysemanticity: Individual neurons respond to multiple concepts

Key Papers

See Also

Last updated: November 27, 2025