STRATEGY

Weeks to result

Weights vs. Context Decision Rule

Know when data belongs in model weights vs. retrieval to cut AI inference costs 100x

AI architecture LLM cost optimization fine-tuning RAG inference SLM

Problem it solves

Teams overpay on AI inference by routing stable historical knowledge through expensive context windows when it could be embedded in model weights at training time.

Best for

Technical founders, AI engineers, and CTOs designing AI pipelines who need a clear architectural decision rule for when to fine-tune vs. use retrieval-augmented generation.

Not ideal for

Teams whose data changes hourly or where virtually all knowledge is real-time, making periodic fine-tuning impractical regardless of cost.

Overview

Why this framework exists

This decision rule separates company knowledge into two buckets by mutability. Stable, historical knowledge—company history, policies, processes, past decisions—should be trained into model weights so it can be recalled instantly at near-zero token cost. Dynamic, real-time information—live prices, recent news, same-day events—should be fetched at inference via tool calls, web search, or RAG. Misrouting stable knowledge through context windows inflates token usage by up to 100x and adds retrieval latency on every query. Applying this rule to your top query patterns systematically reduces inference spend and speeds up agentic workflows without sacrificing accuracy on genuinely dynamic tasks.

Core principles

4 total

Stable historical knowledge belongs in model weights—not in context windows
Dynamic real-time information belongs in retrieval pipelines—not in model weights
Every token pulled into context costs money; knowledge in weights is free at inference time
Routing decisions should be made at design time based on how frequently the knowledge changes

Steps

6 steps

List your most frequent AI query types
Document the top 10–20 queries or workflows your AI system handles, ranked by frequency and monthly token cost. This prioritized list is your starting point for the weights vs. retrieval decision.
Pro tipPull this directly from your API billing logs sorted by token cost descending—this finds the highest-impact candidates automatically.
Classify each query as stable or dynamic
For each query type, ask: 'Would the correct answer change week to week?' Historical company data, internal policies, and past decisions are stable. News, live prices, and recent events are dynamic.
Pro tipA useful heuristic: if the answer would have been the same six months ago, it is stable and belongs in weights.
Calculate retrieval cost for stable queries
For each stable-knowledge query currently answered via context retrieval, log the average token count per run and multiply by monthly frequency. This total is avoidable waste you can eliminate through fine-tuning.
WarningDo not estimate—measure actual token usage from your billing dashboard. Estimates are typically 3–5x off from reality.
Train stable knowledge into model weights
Prioritize fine-tuning or post-training your model on the stable-knowledge categories with the highest query frequency and token cost. After training, those queries no longer require context retrieval steps.
Pro tipStart with the single highest-cost stable knowledge category. A focused fine-tuning run often pays for itself within the first week of deployment.
Maintain retrieval pipelines only for dynamic data
Keep tool calls, web search, and RAG pipelines active only for knowledge that changes faster than your training cadence—live market data, breaking news, and same-day events.
WarningIf you route dynamic data into weights the model will answer with confidently wrong outdated information, destroying user trust quickly.
Benchmark token usage and latency after rerouting
Re-run your baseline queries after fine-tuning and compare token count, response latency, and accuracy against the pre-training baseline. Use results to identify the next batch of stable queries to migrate.
Pro tipExpect 50–100x token reduction on formerly retrieval-heavy stable queries and 5–10x latency improvement on those query types.

Checklist

Saved in your browser

List your top 10–20 most frequent AI query patterns
Classify each as stable/historical or dynamic/real-time
Estimate current token cost for each stable-knowledge query type
Flag all stable-knowledge queries as candidates for weight-training
Retain retrieval pipelines only for genuinely dynamic information
Measure token usage and latency before and after rerouting to confirm improvement

Examples

2 cases

Abraham Lincoln vs. Coachella: The Core Analogy

Asked about Abraham Lincoln, a well-trained model answers instantly because Lincoln's history is already in its weights—no search required. Asked about Madonna's live performance at Coachella last weekend, the same model must trigger a web search because that event post-dates its training data. The performance difference is stark: the weights-based answer is sub-second; the retrieval-based answer may take 30–60 seconds. This contrast defines precisely when each architectural layer should be used.

OutcomeSub-second response for in-weights knowledge vs. 30–60 second retrieval for dynamic data—illustrating the full cost and speed gap between the two approaches.

Uber Growth History: $1,500 vs. Near-Zero Cost

If Uber's growth story from 2019 to 2024 is not in the model's weights, answering 'How did we grow from 2019 to 2024?' requires reading every quarterly report and earnings call from that period. This retrieval process can take over two hours and cost up to $1,500 in tokens per run. The same query on a model trained on that history resolves instantly at near-zero marginal token cost because the answer is already embedded in the weights.

OutcomeEstimated 99%+ cost reduction and near-instantaneous response time by moving historical company data from context retrieval into trained model weights.

Common mistakes

3 traps

Training real-time data into weights

Model weights cannot update fast enough for data that changes daily or hourly. Training live prices, news feeds, or same-day events into weights produces a model that answers with confidently wrong outdated information. Keep all real-time data in retrieval pipelines.

Defaulting all knowledge to RAG

Using retrieval-augmented generation for everything—including stable historical knowledge—wastes tokens and adds latency on every query. Historical knowledge with no freshness requirement should be trained into weights, not retrieved at runtime on every call.

Not measuring the cost baseline first

Without tracking token costs per query type before rerouting, teams cannot quantify savings, identify highest-impact targets, or make the business case to invest in fine-tuning. Always measure before you optimize.

Origin story

How this framework came to be

Extracted from This Week in Startups (E2278), from an analogy used by Josh Cerot (Aragon CEO) contrasting Abraham Lincoln knowledge (in weights) with live Coachella updates (tool calls) to explain where different types of company knowledge belong architecturally.

Source

Traced to primary

Source · VIDEO

Why Your Company Should Own Its AI Model | E2278 — This Week in Startups

This Week in Startups · 2026

Open source →

Related frameworks

Browse all Strategy →