Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
PaperInterpretabilityAnthropic
Paper Details
📄 Read the full paper on Transformer Circuits →
Abstract
We scale sparse autoencoders to extract interpretable features from Claude 3 Sonnet, a state-of-the-art large language model. We find millions of features corresponding to a wide variety of concepts, including cities, people, emotions, programming concepts, and many others. We demonstrate that these features can be used to steer model behavior, and show examples of safety-relevant features including those related to deception, dangerous content, and bias.
Key Contributions
- Scaling to frontier models: First application of sparse autoencoders to a state-of-the-art production model
- Millions of features: Extracted interpretable features at unprecedented scale
- Safety-relevant features: Identified features related to deception, harm, and bias
- Behavioral steering: Demonstrated that features can be manipulated to change model behavior
Summary
This paper extends the work from Towards Monosemanticity to a much larger scale. The researchers trained sparse autoencoders on Claude 3 Sonnet and extracted millions of interpretable features.
Key findings include:
- Abstract features: Features representing abstract concepts like "inner conflict" or "deception"
- Multimodal features: Some features respond to both text and images
- Safety features: Features that activate on potentially harmful content
- Steering capabilities: Amplifying or suppressing features changes model behavior predictably
Implications for Safety
The paper demonstrates several safety-relevant applications:
- Detecting when models might be engaging in deception
- Understanding what triggers potentially harmful outputs
- Potentially using features for more targeted safety interventions
Impact
This represents a major milestone for mechanistic interpretability, showing that the approach can scale to the largest and most capable models. It suggests a path toward understanding and controlling AI systems by understanding their internal representations.
Related Articles
Last updated: November 27, 2025