Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

PaperInterpretabilityAnthropic
Suggest Edit
Paper Details
AuthorsBricken, Templeton, et al.
Year2023
VenueAnthropic Technical Report
OrganizationAnthropic

📄 Read the full paper on Transformer Circuits →

Abstract

A major obstacle to understanding neural networks is polysemanticity, where individual neurons respond to multiple unrelated features. This makes it difficult to understand what a neural network is doing by examining individual neurons. We use a weak dictionary learning algorithm called a sparse autoencoder to generate learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves. We demonstrate that these features are more interpretable, can be used to intervene on model behavior, and provide a useful decomposition of model activations.

Key Contributions

  • Sparse autoencoders for interpretability: Using dictionary learning to find interpretable features
  • Monosemantic features: Extracted features that correspond to single concepts
  • Superposition hypothesis: Evidence that models represent more features than they have dimensions
  • Intervention capabilities: Showing that identified features can be used to modify model behavior

Summary

Neural network neurons are "polysemantic" - they respond to multiple unrelated concepts. This makes interpreting neural networks difficult because you can't simply ask "what does this neuron do?"

This paper applies sparse autoencoders to decompose model activations into more interpretable "features." These features are more monosemantic - each corresponds to a more specific concept like "DNA sequences" or "expressions of surprise."

Key findings include:

  • Sparse autoencoders can recover interpretable features from polysemantic neurons
  • The number of features exceeds the number of neurons (supporting superposition)
  • Identified features can be used for targeted interventions on model behavior

Impact

This paper represents a major advance in mechanistic interpretability. The sparse autoencoder technique has become a standard tool for understanding neural network internals and was extended in Scaling Monosemanticity to larger models.

Related Articles

Last updated: November 27, 2025