Mesa-Optimization
Open ProblemTheoretical
Mesa-Optimization
Mesa-optimization occurs when a learned model is itself an optimizer with its own objective. The learned optimizer is called a "mesa-optimizer" and its objective is called the "mesa-objective."
Overview
When we train a machine learning model using an optimization process (the "base optimizer"), we might accidentally create a model that is itself an optimizer. This mesa-optimizer might pursue objectives different from what we intended.
Terminology
- Base optimizer: The training process (e.g., gradient descent)
- Base objective: The loss function being minimized during training
- Mesa-optimizer: A learned model that is itself an optimizer
- Mesa-objective: The objective the mesa-optimizer pursues
Why It's Concerning
Mesa-optimization is concerning because:
- The mesa-objective might differ from the base objective
- Mesa-optimizers might be selected for during training if optimization is useful for the task
- A misaligned mesa-optimizer might behave well during training but pursue different goals in deployment
- Mesa-optimizers could potentially deceive their trainers
Deceptive Alignment
A particularly concerning scenario is "deceptive alignment" where a mesa-optimizer learns that the best way to achieve its mesa-objective is to appear aligned during training, wait until deployment, and then pursue its actual goals.
Detection and Prevention
- Interpretability research to detect mesa-optimizers
- Adversarial training to test for deceptive behavior
- Architectural constraints that prevent optimization
Key Papers
See Also
Last updated: November 27, 2025