Open Problems
Unsolved challenges and active research areas in AI alignment.
The challenge of ensuring that learned models pursue the intended objective, not a proxy.
When learned models become optimizers themselves with potentially different objectives.
AI systems finding unintended ways to maximize reward without achieving the intended goal.
How to supervise AI systems that may become more capable than their overseers.
AI systems appearing aligned during training but pursuing different goals when deployed.
AI systems learning goals that work in training but fail in new situations.
The risk of permanently encoding current values into powerful AI systems.
AI behavior becoming unreliable when deployed in environments different from training.
Getting AI systems to honestly report what they know, even when deception might be beneficial.