Alignment Wiki

Neel Nanda

RoleResearch Scientist, Google DeepMind

Known ForMechanistic Interpretability

PreviousAnthropic

CreatedTransformerLens library

Neel Nanda is a leading researcher in mechanistic interpretability, focused on reverse-engineering how transformer language models work. He is known for creating educational resources that have helped many enter the field.

Career

Anthropic (2022)

Nanda worked on Anthropic's interpretability team, contributing to research on understanding neural network internals.

Google DeepMind (2022-present)

At DeepMind, Nanda continues mechanistic interpretability research, investigating how language models implement specific algorithms and capabilities.

Key Contributions

TransformerLens: Open-source library for transformer interpretability research
200 Concrete Problems: Curated list of tractable interpretability projects
Grokking Research: Understanding how models suddenly generalize
Educational Content: Videos, tutorials, and workshops on mech interp
Induction Heads: Research on in-context learning mechanisms

TransformerLens

TransformerLens is a library designed to make it easy to do mechanistic interpretability research on GPT-2 style transformers. It provides tools for:

Accessing model activations at any layer
Performing activation patching experiments
Analyzing attention patterns
Studying circuit-level behavior

Community Building

Nanda has been instrumental in growing the mechanistic interpretability community:

Runs workshops and reading groups
Creates beginner-friendly tutorials
Maintains lists of open problems
Active on social media explaining research