The Intelligence Scaling Law
10x compute/year for a decade produced emergent AI — no ceiling in sight
The Intelligence Scaling Law framework, drawn from Mustafa Suleyman's decade at DeepMind, holds that AI capability has scaled at roughly 10x per year in compute terms for 10 consecutive years — from 2 petaflops in 2013 to 10 billion petaflops in 2023, a 5-billion-fold increase. This trajectory produced capabilities that were not predicted even by the researchers running the experiments. The key insight is that the same generation methods that produce structured local data (image pixels near pixels creating edges, edges creating faces) also work for abstract sequential data like language — just at larger scale. This was not obvious in advance.
The framework distinguishes between predictable and surprising capability emergence. Image generation and audio generation were intuitable because the data has local structure. Language models were genuinely surprising because language seemed like a qualitatively different kind of abstraction — yet the same training method, scaled up, produced GPT-class capabilities. This means the scaling law is more general than originally understood: it applies across data modalities, and the capabilities that emerge at each scale level cannot be reliably predicted by extrapolating from lower-scale behavior.
For decision-making, the framework's practical implication is that every 12–18 months now feels like a paradigm shift because it effectively is one. The compute doubling time is faster than human institutional adaptation cycles. This creates a structural gap between AI capability and the governance, safety, and cultural infrastructure needed to manage it — the same gap that underlies the Containment Problem framework.
- Compute scaling has been the primary driver of AI capability for over a decade — roughly 10x per year — and there is no known ceiling to this trajectory
- The same generation methods work across fundamentally different data modalities (images, audio, language) when applied at sufficient scale — modality boundaries are less fundamental than they appear
- Emergent capabilities cannot be reliably predicted by extrapolating from lower-scale behavior — the system invents strategies and knowledge that its designers did not anticipate and cannot always evaluate
- Once AI systems can generate knowledge humans cannot evaluate in real time, traditional safety mechanisms (testing, red-teaming, audit) become structurally inadequate
- Defensive AI capability must scale with offensive AI capability — creating an ongoing infrastructure demand that is non-discretionary
- Anchor to the compute trajectory, not the capability snapshotWhen assessing AI systems, ground the analysis in compute spend and scaling trend rather than current capability. Current capability is a lagging indicator; compute trajectory is the leading one. A system trained at 10x the compute of its predecessor will exhibit qualitatively different behavior, often in ways that cannot be predicted from the predecessor's performance.Pro tipUse Suleyman's quantified anchor: 2 petaflops (2013 DeepMind Atari) to 10 billion petaflops (2023 Inflection Pi) in 10 years = 10x per year. Use this as the baseline for evaluating where any current system sits on the curve.
- Separate predictable from emergent capabilityFor any new AI application, distinguish between capabilities that are predictable from scaling (image generation, audio synthesis, structured data processing) and capabilities where emergence is likely (cross-modal reasoning, strategic planning, novel knowledge generation). Predictable capabilities are easier to governance and plan for; emergent capabilities require a different risk posture.WarningThe history of scaling shows that capabilities researchers were confident would not emerge at a given scale level often do emerge — sometimes significantly before the predicted scale.
- Apply the Move 37 test to AI safety evaluationsAsk: could the evaluators recognize a mistake in this system's output in real time? If the answer is no — if the system's reasoning is moving faster than human evaluation capacity — then the safety evaluation is already a lagging process. This is the Move 37 inflection point: the moment AI capability outpaces human real-time evaluation.Pro tipThe Move 37 test is a binary: either human evaluators can assess outputs in real time, or they cannot. If not, standard testing and red-teaming are insufficient — the system needs AI-vs-AI defensive mechanisms.
- Project forward across modalitiesThe scaling law's cross-modal generality means capability gains in one domain (language) transfer to adjacent domains (code, reasoning, multimodal) at scale. Project AI capability forward by identifying which adjacent modalities or domains become tractable at the next scale level — not by extrapolating current benchmark performance curves.WarningSynthetic biology is the highest-risk adjacent modality: genome sequencing costs have fallen 1,000,000x since 2000, synthesis of novel DNA sequences is now possible, and AI-accelerated DNA design within 5–10 years creates engineered pathogen risk.
- Account for the defensive demand floorAI-vs-AI defensive applications (fraud detection, spam filtering, security anomaly detection) create non-discretionary, ongoing inference demand that scales with the offensive capability threat. This is a baseload demand case — it does not disappear in a downturn because the threat it defends against also does not disappear. Factor this into any AI infrastructure demand model.Pro tipSuleyman's framing: 'We need AIs to defend us from AIs.' This creates a compound demand dynamic — more capable offensive AI requires more capable defensive AI, which requires more compute, which drives more capability development.
DeepMind trained an AI on raw Atari game pixel data with only the game score as a reward signal — no human-provided strategies, no game knowledge. The system invented the tunnel-behind-the-wall strategy in Breakout, a technique human players had not systematically identified. This was the first empirical demonstration that scaled training on raw observational data produces genuinely novel strategic knowledge, not just recombination of human-provided patterns.
In Game 2 of the 2016 Alpha Go match against Lee Sedol, the AI played Move 37 — a stone placement that commentators initially called a mistake. It took human Go experts time after the game to recognize the move's strategic depth. The AI had generated a move that human experts could not evaluate in real time, even knowing the game was against the world champion and expecting surprising play.
GPT-3, which required massive compute clusters when originally released by OpenAI, was reproduced in fully open-source form at 60–70x smaller scale within approximately 2–3 years. By 2023, GPT-3 class capability was 'completely freely available on the web.' This demonstrates the democratization timeline: frontier capability at year zero becomes consumer-hardware accessible within 2–3 years, regardless of initial access restrictions.
Suleyman was present at the founding experiments that established the scaling intuition. In 2013, DeepMind trained an AI on raw Atari game pixel data with only a reward signal (the game score). The system invented strategies for Breakout — including a tunnel-behind-the-wall strategy — that human players hadn't noticed. This was the first empirical demonstration that scaled training on raw data could produce genuinely novel knowledge, not just pattern-matching on human-generated strategies.
The Alpha Go project deepened the insight. Move 37 in Game 2 of the 2016 match against Lee Sedol — a move commentators initially thought was a mistake — turned out to be a move of such strategic depth that it could not be evaluated in real time by humans. This was the moment AI moved from 'executing human knowledge' to 'generating new knowledge humans cannot evaluate.' Suleyman was running DeepMind's applied AI division through this period, giving him direct observational access to the transition from narrow task performance to emergent strategic reasoning.