Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

PaperInterpretabilityAnthropic
Suggest Edit
Paper Details
AuthorsTempleton, Conerly, et al.
Year2024
VenueAnthropic Technical Report
OrganizationAnthropic

📄 Read the full paper on Transformer Circuits →

Abstract

We scale sparse autoencoders to extract interpretable features from Claude 3 Sonnet, a state-of-the-art large language model. We find millions of features corresponding to a wide variety of concepts, including cities, people, emotions, programming concepts, and many others. We demonstrate that these features can be used to steer model behavior, and show examples of safety-relevant features including those related to deception, dangerous content, and bias.

Key Contributions

  • Scaling to frontier models: First application of sparse autoencoders to a state-of-the-art production model
  • Millions of features: Extracted interpretable features at unprecedented scale
  • Safety-relevant features: Identified features related to deception, harm, and bias
  • Behavioral steering: Demonstrated that features can be manipulated to change model behavior

Summary

This paper extends the work from Towards Monosemanticity to a much larger scale. The researchers trained sparse autoencoders on Claude 3 Sonnet and extracted millions of interpretable features.

Key findings include:

  • Abstract features: Features representing abstract concepts like "inner conflict" or "deception"
  • Multimodal features: Some features respond to both text and images
  • Safety features: Features that activate on potentially harmful content
  • Steering capabilities: Amplifying or suppressing features changes model behavior predictably

Implications for Safety

The paper demonstrates several safety-relevant applications:

  • Detecting when models might be engaging in deception
  • Understanding what triggers potentially harmful outputs
  • Potentially using features for more targeted safety interventions

Impact

This represents a major milestone for mechanistic interpretability, showing that the approach can scale to the largest and most capable models. It suggests a path toward understanding and controlling AI systems by understanding their internal representations.

Related Articles

Last updated: November 27, 2025