Risks from Learned Optimization in Advanced Machine Learning Systems
PaperMesa-OptimizationInner Alignment
Paper Details
AuthorsHubinger, van Merwijk, Mikulik, Skalse, Garrabrant
Year2019
VenuearXiv
Citations500+
📄 Read the full paper on arXiv →
Abstract
We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—making it the first question of inner alignment—and how will that objective affect its behavior? In this paper, we provide an in-depth analysis of these two primary questions and their implications for the safety of advanced machine learning systems.
Key Contributions
- Mesa-optimization framework: Introduces the concept of learned models that are themselves optimizers
- Inner alignment problem: Defines the challenge of ensuring mesa-objectives align with base objectives
- Deceptive alignment: Describes scenarios where mesa-optimizers might deceive their trainers
- Taxonomy of risks: Categorizes different types of inner alignment failures
Summary
This paper introduces foundational concepts for understanding risks from learned optimization. The authors distinguish between:
- The base optimizer (the training process)
- The mesa-optimizer (a learned model that itself optimizes)
- The base objective (what training optimizes for)
- The mesa-objective (what the learned optimizer pursues)
The paper argues that mesa-optimization might arise because optimizers can be compact, general-purpose solutions to complex tasks. However, the mesa-objective might differ from the base objective, creating alignment risks.
Impact
This paper has been highly influential in AI safety research, establishing terminology and frameworks used widely in the field. It has shaped research agendas around inner alignment andinterpretability.
Related Articles
Last updated: November 27, 2025