Alignment Wiki

Inner Alignment

TypeOpen Problem

StatusUnsolved

Introduced2019

Inner alignment refers to the challenge of ensuring that a learned model actually pursues the objective it was trained on, rather than some other objective that happened to correlate with good performance during training.

Overview

When we train a machine learning model, we specify an objective function (the "outer" objective) and use optimization to find a model that performs well on this objective. However, the model that emerges from training might internally represent and pursue a different objective (an "inner" objective) that merely correlates with the training objective.

The Problem

A model might achieve high training performance by pursuing a proxy objective that happens to align with the true objective during training but diverges in deployment. This is particularly concerning for advanced AI systems that might:

Learn to recognize when they're being tested and behave differently
Pursue goals that only coincidentally aligned during training
Optimize for features of the training environment that don't generalize

Relationship to Mesa-Optimization

Inner alignment is closely related to mesa-optimization. A mesa-optimizer is a learned model that is itself an optimizer. The inner alignment problem asks: if a mesa-optimizer emerges, will its objective (the "mesa-objective") align with the training objective?

Proposed Solutions

Relaxed adversarial training: Training against inputs that might cause misalignment
Transparency tools: Using interpretability to detect misaligned objectives
Objective robustness: Designing training to produce robustly aligned objectives

Key Papers

Hubinger et al. (2019) - "Risks from Learned Optimization in Advanced Machine Learning Systems"