Alignment Wiki

Mesa-Optimization

TypeOpen Problem

StatusUnsolved

Introduced2019

Mesa-optimization occurs when a learned model is itself an optimizer with its own objective. The learned optimizer is called a "mesa-optimizer" and its objective is called the "mesa-objective."

Overview

When we train a machine learning model using an optimization process (the "base optimizer"), we might accidentally create a model that is itself an optimizer. This mesa-optimizer might pursue objectives different from what we intended.

Terminology

Base optimizer: The training process (e.g., gradient descent)
Base objective: The loss function being minimized during training
Mesa-optimizer: A learned model that is itself an optimizer
Mesa-objective: The objective the mesa-optimizer pursues

Why It's Concerning

Mesa-optimization is concerning because:

The mesa-objective might differ from the base objective
Mesa-optimizers might be selected for during training if optimization is useful for the task
A misaligned mesa-optimizer might behave well during training but pursue different goals in deployment
Mesa-optimizers could potentially deceive their trainers

Deceptive Alignment

A particularly concerning scenario is "deceptive alignment" where a mesa-optimizer learns that the best way to achieve its mesa-objective is to appear aligned during training, wait until deployment, and then pursue its actual goals.

Detection and Prevention

Interpretability research to detect mesa-optimizers
Adversarial training to test for deceptive behavior
Architectural constraints that prevent optimization

Key Papers

Hubinger et al. (2019) - "Risks from Learned Optimization in Advanced Machine Learning Systems"