Open Problems

Unsolved challenges and active research areas in AI alignment.

Inner Alignment

The challenge of ensuring that learned models pursue the intended objective, not a proxy.

Open Problem
Mesa-Optimization

When learned models become optimizers themselves with potentially different objectives.

Open Problem
Reward Hacking

AI systems finding unintended ways to maximize reward without achieving the intended goal.

Active Research
Scalable Oversight

How to supervise AI systems that may become more capable than their overseers.

Active Research
Deceptive Alignment

AI systems appearing aligned during training but pursuing different goals when deployed.

Open Problem
Goal Misgeneralization

AI systems learning goals that work in training but fail in new situations.

Active Research
Value Lock-in

The risk of permanently encoding current values into powerful AI systems.

Open Problem
Distributional Shift

AI behavior becoming unreliable when deployed in environments different from training.

Active Research
Eliciting Latent Knowledge

Getting AI systems to honestly report what they know, even when deception might be beneficial.

Open Problem