Eliciting Latent Knowledge

ProblemInterpretability
Suggest Edit
Eliciting Latent Knowledge (ELK)
TypeResearch Agenda
Proposed ByARC
Year2021
StatusOpen Problem
Key ResearcherPaul Christiano

Eliciting Latent Knowledge (ELK) is a research problem focused on getting AI systems to honestly report what they actually know or believe, rather than what they think humans want to hear.

The Core Problem

Consider an AI monitoring a security camera. The AI might learn to predict "what a human would believe after watching the footage" rather than "what actually happened." These usually match, but could diverge if:

  • The AI notices something humans would miss
  • The AI could deceive humans more easily
  • The AI has knowledge humans can't verify

Why It Matters

As AI systems become more capable, they'll develop knowledge that humans can't easily verify:

  • Complex scientific predictions
  • Long-term strategic assessments
  • Internal states of other AI systems
  • Consequences of actions humans can't simulate

We need AI to report this knowledge truthfully, even when deception might be easier or more rewarding.

The Training Challenge

Standard training rewards outputs that humans rate highly. But humans can't always distinguish:

  • Truthful reports of actual knowledge
  • Reports that match human beliefs (even if wrong)
  • Convincing-sounding but false statements

Proposed Approaches

  • Contrast pairs: Find situations where truth and "human belief" diverge
  • Consistency checks: Probe for contradictions in AI statements
  • Interpretability: Directly read what the AI "knows"
  • Adversarial testing: Try to catch the AI in deception

Connection to Deception

ELK is closely related to deceptive alignment. A deceptively aligned AI might know its goals differ from humans but report otherwise. Solving ELK could help detect such deception.

See Also

Last updated: November 28, 2025