AlignmentWiki — Zero Sum & AI Alignment Research

Redwood Research

TypeAI Safety Research Lab

Founded2021

FoundersNate Thomas, Buck Shlegeris

HeadquartersBerkeley, CA

FocusApplied alignment research

Redwood Research is a nonprofit AI safety research lab focused on applied alignment research. The organization works on practical techniques for making AI systems safer, with a focus on empirical approaches rather than theoretical frameworks.

Overview

Founded by researchers from the effective altruism community, Redwood Research takes a hands-on approach to alignment. They work directly with language models to develop and test safety techniques that can be applied to current systems.

Key Research Areas

Adversarial Training

Redwood developed techniques for training models to be robust against adversarial inputs, including methods for preventing language models from producing harmful outputs even when prompted to do so.

Causal Scrubbing

The lab developed "causal scrubbing," a technique for rigorously testing claims about how neural networks work internally. This contributes to the broader field ofinterpretability.

Activation Engineering

Research on directly manipulating model activations to control behavior, related to steering vectors and representation engineering approaches.

Notable Projects

Adversarial training for language models
Causal scrubbing methodology
Injury classifier project
Representation engineering research

External Sources

🔗Redwood Research
📄Causal Scrubbing Blog Post