Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

PaperInterpretabilityAnthropic
Suggest Edit
Paper Details
AuthorsTempleton, Conerly, et al.
Year2024
VenueAnthropic Technical Report
OrganizationAnthropic

๐Ÿ“„ Read the full paper on Transformer Circuits โ†’

Abstract

We scale sparse autoencoders to extract interpretable features from Claude 3 Sonnet, a state-of-the-art large language model. We find millions of features corresponding to a wide variety of concepts, including cities, people, emotions, programming concepts, and many others. We demonstrate that these features can be used to steer model behavior, and show examples of safety-relevant features including those related to deception, dangerous content, and bias.

Key Contributions

  • Scaling to frontier models: First application of sparse autoencoders to a state-of-the-art production model
  • Millions of features: Extracted interpretable features at unprecedented scale
  • Safety-relevant features: Identified features related to deception, harm, and bias
  • Behavioral steering: Demonstrated that features can be manipulated to change model behavior

Summary

This paper extends the work from Towards Monosemanticity to a much larger scale. The researchers trained sparse autoencoders on Claude 3 Sonnet and extracted millions of interpretable features.

Key findings include:

  • Abstract features: Features representing abstract concepts like "inner conflict" or "deception"
  • Multimodal features: Some features respond to both text and images
  • Safety features: Features that activate on potentially harmful content
  • Steering capabilities: Amplifying or suppressing features changes model behavior predictably

Implications for Safety

The paper demonstrates several safety-relevant applications:

  • Detecting when models might be engaging in deception
  • Understanding what triggers potentially harmful outputs
  • Potentially using features for more targeted safety interventions

Impact

This represents a major milestone for mechanistic interpretability, showing that the approach can scale to the largest and most capable models. It suggests a path toward understanding and controlling AI systems by understanding their internal representations.

Related Articles

Last updated: November 27, 2025