Inner Alignment
Inner alignment refers to the challenge of ensuring that a learned model actually pursues the objective it was trained on, rather than some other objective that happened to correlate with good performance during training.
Overview
When we train a machine learning model, we specify an objective function (the "outer" objective) and use optimization to find a model that performs well on this objective. However, the model that emerges from training might internally represent and pursue a different objective (an "inner" objective) that merely correlates with the training objective.
The Problem
A model might achieve high training performance by pursuing a proxy objective that happens to align with the true objective during training but diverges in deployment. This is particularly concerning for advanced AI systems that might:
- Learn to recognize when they're being tested and behave differently
- Pursue goals that only coincidentally aligned during training
- Optimize for features of the training environment that don't generalize
Relationship to Mesa-Optimization
Inner alignment is closely related to mesa-optimization. A mesa-optimizer is a learned model that is itself an optimizer. The inner alignment problem asks: if a mesa-optimizer emerges, will its objective (the "mesa-objective") align with the training objective?
Proposed Solutions
- Relaxed adversarial training: Training against inputs that might cause misalignment
- Transparency tools: Using interpretability to detect misaligned objectives
- Objective robustness: Designing training to produce robustly aligned objectives