Inner Alignment

Open ProblemTheoretical
Suggest Edit
Inner Alignment
TypeOpen Problem
StatusUnsolved
Introduced2019

Inner alignment refers to the challenge of ensuring that a learned model actually pursues the objective it was trained on, rather than some other objective that happened to correlate with good performance during training.

Overview

When we train a machine learning model, we specify an objective function (the "outer" objective) and use optimization to find a model that performs well on this objective. However, the model that emerges from training might internally represent and pursue a different objective (an "inner" objective) that merely correlates with the training objective.

The Problem

A model might achieve high training performance by pursuing a proxy objective that happens to align with the true objective during training but diverges in deployment. This is particularly concerning for advanced AI systems that might:

  • Learn to recognize when they're being tested and behave differently
  • Pursue goals that only coincidentally aligned during training
  • Optimize for features of the training environment that don't generalize

Relationship to Mesa-Optimization

Inner alignment is closely related to mesa-optimization. A mesa-optimizer is a learned model that is itself an optimizer. The inner alignment problem asks: if a mesa-optimizer emerges, will its objective (the "mesa-objective") align with the training objective?

Proposed Solutions

  • Relaxed adversarial training: Training against inputs that might cause misalignment
  • Transparency tools: Using interpretability to detect misaligned objectives
  • Objective robustness: Designing training to produce robustly aligned objectives

Key Papers

See Also

Last updated: November 27, 2025