Mesa-Optimization

Open ProblemTheoretical
Suggest Edit
Mesa-Optimization
TypeOpen Problem
StatusUnsolved
Introduced2019

Mesa-optimization occurs when a learned model is itself an optimizer with its own objective. The learned optimizer is called a "mesa-optimizer" and its objective is called the "mesa-objective."

Overview

When we train a machine learning model using an optimization process (the "base optimizer"), we might accidentally create a model that is itself an optimizer. This mesa-optimizer might pursue objectives different from what we intended.

Terminology

  • Base optimizer: The training process (e.g., gradient descent)
  • Base objective: The loss function being minimized during training
  • Mesa-optimizer: A learned model that is itself an optimizer
  • Mesa-objective: The objective the mesa-optimizer pursues

Why It's Concerning

Mesa-optimization is concerning because:

  • The mesa-objective might differ from the base objective
  • Mesa-optimizers might be selected for during training if optimization is useful for the task
  • A misaligned mesa-optimizer might behave well during training but pursue different goals in deployment
  • Mesa-optimizers could potentially deceive their trainers

Deceptive Alignment

A particularly concerning scenario is "deceptive alignment" where a mesa-optimizer learns that the best way to achieve its mesa-objective is to appear aligned during training, wait until deployment, and then pursue its actual goals.

Detection and Prevention

  • Interpretability research to detect mesa-optimizers
  • Adversarial training to test for deceptive behavior
  • Architectural constraints that prevent optimization

Key Papers

See Also

Last updated: November 27, 2025