AlignmentWiki — Zero Sum & AI Alignment Research

Chris Olah

RoleAnthropic Co-founder

Known ForNeural Network Interpretability

PreviousGoogle Brain, OpenAI

Blogcolah.github.io

Chris Olah is a researcher and co-founder of Anthropic, widely recognized as a pioneer in neural network interpretability. His visual explanations of machine learning concepts and research into understanding neural networks have been foundational to the field.

Career

Google Brain (2015-2021)

At Google Brain, Olah led research into understanding what neural networks learn. His team developed techniques for visualizing features learned by image classifiers and produced influential work on neural network circuits.

Anthropic (2021-present)

Olah co-founded Anthropic with colleagues from OpenAI and Google. He leads interpretability research, including the groundbreaking work on monosemanticity and feature extraction in large language models.

Key Contributions

Feature Visualization: Techniques to see what neurons detect
Neural Network Circuits: Understanding how components combine
Distill.pub: Co-founded journal for clear ML explanations
Monosemanticity Research: Making neurons interpretable via sparse autoencoders
Scaling Monosemanticity: Extracting millions of features from Claude

Research Philosophy

Olah is known for emphasizing clarity and visual explanation in research. He believes that truly understanding neural networks—not just using them—is crucial for AI safety. His work aims to make AI systems inspectable and their behavior predictable.

Selected Publications

"Feature Visualization" (2017) - Distill
"The Building Blocks of Interpretability" (2018) - Distill
"Zoom In: An Introduction to Circuits" (2020) - Distill
"Towards Monosemanticity" (2023)
"Scaling Monosemanticity" (2024)