Alignment Wiki

Interpretability

TypeResearch Area

StatusActive Research

Key OrgsAnthropic, DeepMind, OpenAI

Interpretability (also called mechanistic interpretability) is a field of AI safety research focused on understanding how neural networks work internally - what they learn, how they represent information, and why they produce specific outputs.

Overview

Modern AI systems, particularly large language models, are often described as "black boxes" - we know what goes in and what comes out, but not what happens in between. Interpretability research aims to open these black boxes.

Understanding AI internals is crucial for alignment because it allows us to verify that models are reasoning in the ways we expect, detect deceptive or manipulative behavior, and identify potential failure modes before deployment. This connects to problems like inner alignment and mesa-optimization.

Key Approaches

Mechanistic Interpretability

Reverse-engineering neural networks to understand the algorithms they implement. This involves identifying circuits, features, and computational patterns within the network. Anthropic's interpretability team has been a leader in this area.

Probing

Training small classifiers on model activations to determine what information is represented at different layers.

Activation Patching

Intervening on specific activations to determine their causal role in producing outputs.

Sparse Autoencoders

Decomposing model activations into interpretable features that correspond to human-understandable concepts. This technique is central to the Towards Monosemanticity and Scaling Monosemanticity papers.

Key Discoveries

Induction heads: Circuits that implement in-context learning
Superposition: Models represent more features than they have dimensions
Feature splitting: Features become more specific at larger scales

Challenges

Scale: Modern models have billions of parameters
Superposition: Features are not cleanly separated
Polysemanticity: Individual neurons respond to multiple concepts

Key Papers

Olah et al. (2020) - "Zoom In: An Introduction to Circuits"
Elhage et al. (2021) - "A Mathematical Framework for Transformer Circuits"
Bricken et al. (2023) - "Towards Monosemanticity"
Templeton et al. (2024) - "Scaling Monosemanticity"