INNOVATIONOngoing practice88% confidence

The Intelligence Scaling Law

10x compute/year for a decade produced emergent AI — no ceiling in sight

Problem it solves

Why AI feels faster than it should, and what the compute-scaling trajectory implies for the next decade

Best for

Understanding why AI capability timelines are compressing; why predicting specific capabilities is hard; analysts, founders, and investors tracking the AI trajectory

Not ideal for

Making specific capability claims or predicting AGI timing with precision

Overview

Why this framework exists

The Intelligence Scaling Law framework, drawn from Mustafa Suleyman's decade at DeepMind, holds that AI capability has scaled at roughly 10x per year in compute terms for 10 consecutive years — from 2 petaflops in 2013 to 10 billion petaflops in 2023, a 5-billion-fold increase. This trajectory produced capabilities that were not predicted even by the researchers running the experiments. The key insight is that the same generation methods that produce structured local data (image pixels near pixels creating edges, edges creating faces) also work for abstract sequential data like language — just at larger scale. This was not obvious in advance.

The framework distinguishes between predictable and surprising capability emergence. Image generation and audio generation were intuitable because the data has local structure. Language models were genuinely surprising because language seemed like a qualitatively different kind of abstraction — yet the same training method, scaled up, produced GPT-class capabilities. This means the scaling law is more general than originally understood: it applies across data modalities, and the capabilities that emerge at each scale level cannot be reliably predicted by extrapolating from lower-scale behavior.

For decision-making, the framework's practical implication is that every 12–18 months now feels like a paradigm shift because it effectively is one. The compute doubling time is faster than human institutional adaptation cycles. This creates a structural gap between AI capability and the governance, safety, and cultural infrastructure needed to manage it — the same gap that underlies the Containment Problem framework.

Core principles

5 total
  1. Compute scaling has been the primary driver of AI capability for over a decade — roughly 10x per year — and there is no known ceiling to this trajectory
  2. The same generation methods work across fundamentally different data modalities (images, audio, language) when applied at sufficient scale — modality boundaries are less fundamental than they appear
  3. Emergent capabilities cannot be reliably predicted by extrapolating from lower-scale behavior — the system invents strategies and knowledge that its designers did not anticipate and cannot always evaluate
  4. Once AI systems can generate knowledge humans cannot evaluate in real time, traditional safety mechanisms (testing, red-teaming, audit) become structurally inadequate
  5. Defensive AI capability must scale with offensive AI capability — creating an ongoing infrastructure demand that is non-discretionary

Steps

5 steps
  1. Anchor to the compute trajectory, not the capability snapshot
    When assessing AI systems, ground the analysis in compute spend and scaling trend rather than current capability. Current capability is a lagging indicator; compute trajectory is the leading one. A system trained at 10x the compute of its predecessor will exhibit qualitatively different behavior, often in ways that cannot be predicted from the predecessor's performance.
    Pro tipUse Suleyman's quantified anchor: 2 petaflops (2013 DeepMind Atari) to 10 billion petaflops (2023 Inflection Pi) in 10 years = 10x per year. Use this as the baseline for evaluating where any current system sits on the curve.
  2. Separate predictable from emergent capability
    For any new AI application, distinguish between capabilities that are predictable from scaling (image generation, audio synthesis, structured data processing) and capabilities where emergence is likely (cross-modal reasoning, strategic planning, novel knowledge generation). Predictable capabilities are easier to governance and plan for; emergent capabilities require a different risk posture.
    WarningThe history of scaling shows that capabilities researchers were confident would not emerge at a given scale level often do emerge — sometimes significantly before the predicted scale.
  3. Apply the Move 37 test to AI safety evaluations
    Ask: could the evaluators recognize a mistake in this system's output in real time? If the answer is no — if the system's reasoning is moving faster than human evaluation capacity — then the safety evaluation is already a lagging process. This is the Move 37 inflection point: the moment AI capability outpaces human real-time evaluation.
    Pro tipThe Move 37 test is a binary: either human evaluators can assess outputs in real time, or they cannot. If not, standard testing and red-teaming are insufficient — the system needs AI-vs-AI defensive mechanisms.
  4. Project forward across modalities
    The scaling law's cross-modal generality means capability gains in one domain (language) transfer to adjacent domains (code, reasoning, multimodal) at scale. Project AI capability forward by identifying which adjacent modalities or domains become tractable at the next scale level — not by extrapolating current benchmark performance curves.
    WarningSynthetic biology is the highest-risk adjacent modality: genome sequencing costs have fallen 1,000,000x since 2000, synthesis of novel DNA sequences is now possible, and AI-accelerated DNA design within 5–10 years creates engineered pathogen risk.
  5. Account for the defensive demand floor
    AI-vs-AI defensive applications (fraud detection, spam filtering, security anomaly detection) create non-discretionary, ongoing inference demand that scales with the offensive capability threat. This is a baseload demand case — it does not disappear in a downturn because the threat it defends against also does not disappear. Factor this into any AI infrastructure demand model.
    Pro tipSuleyman's framing: 'We need AIs to defend us from AIs.' This creates a compound demand dynamic — more capable offensive AI requires more capable defensive AI, which requires more compute, which drives more capability development.

Checklist

Saved in your browser

Examples

3 cases
DeepMind Atari AI — the founding scaling insight (2013)

DeepMind trained an AI on raw Atari game pixel data with only the game score as a reward signal — no human-provided strategies, no game knowledge. The system invented the tunnel-behind-the-wall strategy in Breakout, a technique human players had not systematically identified. This was the first empirical demonstration that scaled training on raw observational data produces genuinely novel strategic knowledge, not just recombination of human-provided patterns.

OutcomeEstablished the core scaling insight: sufficient compute applied to raw sensory data produces emergent knowledge that exceeds the designer's anticipation. This founding experiment shaped DeepMind's decade-long compute scaling bet.
Alpha Go Move 37 — the Move 37 inflection (2016)

In Game 2 of the 2016 Alpha Go match against Lee Sedol, the AI played Move 37 — a stone placement that commentators initially called a mistake. It took human Go experts time after the game to recognize the move's strategic depth. The AI had generated a move that human experts could not evaluate in real time, even knowing the game was against the world champion and expecting surprising play.

OutcomeMarked the empirical moment where AI capability crossed the Move 37 threshold: generating knowledge that humans cannot evaluate in real time. This inflection point changes what safety evaluation can achieve — you cannot test for strategies you cannot imagine.
Open-source GPT-3 reproduction — democratization timeline

GPT-3, which required massive compute clusters when originally released by OpenAI, was reproduced in fully open-source form at 60–70x smaller scale within approximately 2–3 years. By 2023, GPT-3 class capability was 'completely freely available on the web.' This demonstrates the democratization timeline: frontier capability at year zero becomes consumer-hardware accessible within 2–3 years, regardless of initial access restrictions.

OutcomeEstablished the proliferation timeline for AI capability: 2–3 years from frontier to freely downloadable. This timeline is faster than any regulatory response cycle, which is the core structural problem for the Containment Problem framework.

Common mistakes

4 traps
Predicting specific capability timelines from current benchmarks
Benchmark performance curves within a capability tier extrapolate reliably, but the emergence of qualitatively new capabilities at higher scale levels is not predictable from current benchmarks. The researchers who built the image generation systems did not predict that the same method would produce language models at scale. Confident specific capability predictions — 'AGI by [year]' — are not warranted by the scaling evidence.
Treating language models as a different kind of system from image generators
The intuition that language is 'a different abstract space of ideas' led researchers (including Suleyman) to expect that the image generation training method would not transfer to language. It did. The lesson: modality boundaries are less fundamental than they appear. Assuming a given AI method cannot transfer to a new domain because the domain seems qualitatively different is a recurrent error in scaling history.
Evaluating AI safety only with human evaluators
Once AI systems reach Move 37 capability — generating strategies and knowledge that human evaluators cannot assess in real time — human-only safety evaluation becomes a lagging process. Red teams and testers cannot find failure modes they cannot imagine. Safety evaluation at frontier scale requires AI-vs-AI mechanisms, not just human review.
Anchoring AI capability assessment to the current snapshot
Current AI capability is a lagging indicator of the compute trajectory. The system that exists today was trained at a compute level that is already 10x below the current frontier. Anchoring risk assessments, governance frameworks, or investment theses to current capability without accounting for the scaling trajectory will be outdated within 12–18 months.

Origin story

How this framework came to be

Suleyman was present at the founding experiments that established the scaling intuition. In 2013, DeepMind trained an AI on raw Atari game pixel data with only a reward signal (the game score). The system invented strategies for Breakout — including a tunnel-behind-the-wall strategy — that human players hadn't noticed. This was the first empirical demonstration that scaled training on raw data could produce genuinely novel knowledge, not just pattern-matching on human-generated strategies.

The Alpha Go project deepened the insight. Move 37 in Game 2 of the 2016 match against Lee Sedol — a move commentators initially thought was a mistake — turned out to be a move of such strategic depth that it could not be evaluated in real time by humans. This was the moment AI moved from 'executing human knowledge' to 'generating new knowledge humans cannot evaluate.' Suleyman was running DeepMind's applied AI division through this period, giving him direct observational access to the transition from narrow task performance to emergent strategic reasoning.

Source

Traced to primary
Source · PODCAST
CEO Of Microsoft AI: AI Is Becoming More Dangerous And Threatening!
Mustafa Suleyman · 2023
Open source →

Related frameworks

Browse all Innovation →