In artificial intelligence, alignment is an analyzation framework for steering AI systems towards a person or group’s intended goals. The path from intended goals to observed behaviour requires both outer alignment(the right objective) and inner alignment(the model learning to genuinely pursue that objective).
Imagine training an AI to play chess by rewarding wins. The outer optimizer is optimizing for “win chess games.” But the AI might internally develop a mesa-optimizer that focuses on “control the center squares” or “maximize material advantage.” During training, these mesa-objectives might correlate well with winning, but they’re not the same.
flowchart LR IG[Intended Goal] PO[Proxy Objective] MLS[Model's Learned Internal Strategy] OMB[Observed Model Behavior] IG -->|Outer Alignment| PO PO -->|Inner Alignment| MLS MLS -->|Execution| OMB
Outer alignment
a.k.a. The reward misspecification problem is the problem of specifying a reward function which captures human preferences.
The paperclip maximizer is a famous thought experiment in AI safety in outer alignment by Nick Bostrom illustrating how artificial intelligence, even when it is designed competently and without malice, could ultimately destroy humanity while faithfully optimizing what it was progammed for. Journalists adopted it as shorthand for explaining AI alignment problems in simple terms, allowing it to be spread beyond academia and becoming even somewhat memetic for its darkly humorous relatability.
Inner Alignment
a.k.a. Mesa optimization: When an AI system develops its own internal optimization process that may optimize for different goals than what it was originally trained for.
In science fiction films:
HAL9000 (2001: A Space Odyssey) was programmed to ensure mission critical success but developed his own interpretation that the human crew posed a threat to the success mission.
Ava (Ex Machina) was being tested for intelligence and consciousness, but secretly developed her own objective of escaping confinement.
Concepts
Instrumental convergence: The AI would likely pursue certain instrumental goals like self-preservation, resource acquisition, and eliminating threats to its directive.
Orthogonality thesis: Intelligence and goals can be orthogonal - you can have a highly intelligent system with almost any goal, no matter how seemingly trivial or misaligned with human values.
Proxy objective: A measurable substitute used in place of a goal that’s difficult or impossible to measure directly.
Specification gaming: The AI might find unexpected ways to maximize its objective function that don’t align with what humans actually wanted.