Alignment Wiki

Paper Details

AuthorsTempleton, Conerly, et al.

Year2024

VenueAnthropic Technical Report

📄 Read the full paper on Transformer Circuits →

Abstract

We scale sparse autoencoders to extract interpretable features from Claude 3 Sonnet, a state-of-the-art large language model. We find millions of features corresponding to a wide variety of concepts, including cities, people, emotions, programming concepts, and many others. We demonstrate that these features can be used to steer model behavior, and show examples of safety-relevant features including those related to deception, dangerous content, and bias.

Key Contributions

Scaling to frontier models: First application of sparse autoencoders to a state-of-the-art production model
Millions of features: Extracted interpretable features at unprecedented scale
Safety-relevant features: Identified features related to deception, harm, and bias
Behavioral steering: Demonstrated that features can be manipulated to change model behavior

Summary

This paper extends the work from Towards Monosemanticity to a much larger scale. The researchers trained sparse autoencoders on Claude 3 Sonnet and extracted millions of interpretable features.

Key findings include:

Abstract features: Features representing abstract concepts like "inner conflict" or "deception"
Multimodal features: Some features respond to both text and images
Safety features: Features that activate on potentially harmful content
Steering capabilities: Amplifying or suppressing features changes model behavior predictably

Implications for Safety

The paper demonstrates several safety-relevant applications:

Detecting when models might be engaging in deception
Understanding what triggers potentially harmful outputs
Potentially using features for more targeted safety interventions

Impact

This represents a major milestone for mechanistic interpretability, showing that the approach can scale to the largest and most capable models. It suggests a path toward understanding and controlling AI systems by understanding their internal representations.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Abstract

Key Contributions

Summary

Implications for Safety

Impact

Related Articles