INNOVATIONDays to result

Test-Time Compute Reasoning

Let AI think longer before answering to dramatically improve results

Problem it solves

stagnant innovation

Best for

Knowledge workers, researchers, and professionals who need AI to tackle complex analytical tasks requiring multi-step reasoning or expert-level judgment.

Not ideal for

Simple factual queries or tasks where speed matters more than depth. Test-time compute adds latency and cost unnecessary for straightforward questions.

Overview

Why this framework exists

Test-Time Compute Reasoning is the breakthrough approach where AI models spend time invisibly thinking before responding, mimicking human logical problem-solving processes. Unlike traditional AI that generates answers in a single forward pass, these models allocate additional computational resources at inference time to reason through complex problems step by step before producing an output.

This approach proves crucial because it transforms AI from a pattern-matching tool into something closer to a deliberate reasoner. When tested on identifying errors in academic papers, o1 spotted a multiplication error on page seven that peer reviewers missed, a paper claiming black plastic utensils leached harmful compounds had miscalculated dosage by a factor of ten. A Harvard and Stanford working paper concluded that o1-preview demonstrates superhuman performance in differential diagnosis and clinical reasoning.

The practical implication is that how you use AI matters as much as which model you use. Giving AI time and space to reason produces fundamentally different outputs than demanding instant responses, and knowing when to apply this approach versus when speed suffices becomes a critical skill.

Core principles

4 total
  1. Giving AI time to think before answering produces fundamentally better results than instant responses.
  2. Only domain experts can evaluate whether AI reasoning is correct.
  3. The model does not have to get proofs right to be useful; it just has to help us be better researchers.
  4. Test-time compute is most valuable for complex, multi-step problems where pattern matching fails.

Steps

3 steps
  1. Identify Reasoning-Intensive Tasks
    Audit your workflow to identify tasks that require multi-step reasoning, error detection, or synthesis across multiple sources of information. These are the tasks where test-time compute reasoning delivers the greatest advantage over standard AI. Examples include reviewing analyses for logical errors, diagnosing complex problems with multiple variables, or generating novel approaches to research questions.
    Pro tipTasks where you yourself need to think carefully before answering are the ones where reasoning models will help most.
  2. Structure Problems for Deep Reasoning
    Frame your prompts to invite deliberate reasoning rather than quick answers. Provide relevant context, specify the type of analysis needed, and explicitly ask the model to work through its reasoning before concluding. The quality of reasoning output depends heavily on how the problem is presented. Vague prompts produce vague reasoning, while well-structured problems with clear constraints produce rigorous analysis.
    Pro tipAsk the model to identify potential errors or weaknesses in its own reasoning as a final step.
    WarningReasoning models cost more and take longer. Do not use them for simple lookups or straightforward tasks.
  3. Validate with Domain Expertise
    Always have qualified domain experts evaluate the AI reasoning output. As Mollick emphasizes, only experts can assess whether the reasoning is correct. The AI may generate interesting approaches that contain errors. These can still be valuable for spurring further research or identifying new angles, but they must be verified. Treat AI reasoning as a collaborator that proposes ideas, not an oracle that delivers truth.
    Pro tipA Wharton colleague noted the model does not have to get proofs right to be useful, it just has to help us be better researchers.

Checklist

Saved in your browser

Examples

2 cases
OpenAI o1 Catching Academic Paper Errors

When tested on identifying errors in a published academic paper about black plastic utensils leaching harmful compounds, OpenAI o1 model spotted a multiplication error on page seven that professional peer reviewers had completely missed. The paper had miscalculated the dosage by a factor of ten, dramatically overstating the health risk.

OutcomeThe AI identified a factor-of-ten mathematical error that passed through the entire peer review process undetected.
Ethan Mollick, One Useful Thing, 2023
Harvard-Stanford Clinical Reasoning Study

A working paper by researchers at Harvard and Stanford concluded that o1-preview demonstrated superhuman performance in differential diagnosis, diagnostic clinical reasoning, and management reasoning. While not replacing doctors, the study suggested AI as a powerful second opinion tool for medical professionals facing complex diagnostic challenges.

OutcomeAI achieved superhuman-level performance on clinical diagnostic reasoning benchmarks compared to medical professionals.
Harvard and Stanford working paper, cited by Ethan Mollick, 2023

Common mistakes

2 traps
Trusting AI Reasoning Without Expert Review
Reasoning models produce confident, well-structured arguments that can contain subtle errors. One mathematical proof request yielded interesting approaches despite containing errors. Without domain expertise to evaluate the output, you risk acting on plausible-sounding but incorrect reasoning, which is potentially dangerous in fields like medicine, law, or finance.
Using Reasoning Models for Simple Tasks
Test-time compute adds latency and cost. Using a reasoning model to answer a simple factual question is like hiring a team of consultants to pick a lunch restaurant. Match the tool to the task complexity and save reasoning models for problems that actually benefit from multi-step deliberation.

Origin story

How this framework came to be

The test-time compute approach emerged from OpenAI o1 model series released in late 2023, which represented a departure from the standard approach of making models bigger or training them on more data. Instead, o1 models were designed to spend time thinking before answering, allocating additional computation during inference rather than just during training. Ethan Mollick documented how this approach enabled the model to catch a factor-of-ten mathematical error in an academic paper about plastic utensils that professional peer reviewers had completely missed, and how Harvard and Stanford researchers found o1-preview demonstrated superhuman diagnostic reasoning capabilities.

Source

Traced to primary
Source · ESSAY
What Just Happened
Ethan Mollick · 2023
Open source →

Related frameworks

Browse all Innovation →