Weights vs. Context Decision Rule
Know when data belongs in model weights vs. retrieval to cut AI inference costs 100x
This decision rule separates company knowledge into two buckets by mutability. Stable, historical knowledge—company history, policies, processes, past decisions—should be trained into model weights so it can be recalled instantly at near-zero token cost. Dynamic, real-time information—live prices, recent news, same-day events—should be fetched at inference via tool calls, web search, or RAG. Misrouting stable knowledge through context windows inflates token usage by up to 100x and adds retrieval latency on every query. Applying this rule to your top query patterns systematically reduces inference spend and speeds up agentic workflows without sacrificing accuracy on genuinely dynamic tasks.
- Stable historical knowledge belongs in model weights—not in context windows
- Dynamic real-time information belongs in retrieval pipelines—not in model weights
- Every token pulled into context costs money; knowledge in weights is free at inference time
- Routing decisions should be made at design time based on how frequently the knowledge changes
- List your most frequent AI query typesDocument the top 10–20 queries or workflows your AI system handles, ranked by frequency and monthly token cost. This prioritized list is your starting point for the weights vs. retrieval decision.Pro tipPull this directly from your API billing logs sorted by token cost descending—this finds the highest-impact candidates automatically.
- Classify each query as stable or dynamicFor each query type, ask: 'Would the correct answer change week to week?' Historical company data, internal policies, and past decisions are stable. News, live prices, and recent events are dynamic.Pro tipA useful heuristic: if the answer would have been the same six months ago, it is stable and belongs in weights.
- Calculate retrieval cost for stable queriesFor each stable-knowledge query currently answered via context retrieval, log the average token count per run and multiply by monthly frequency. This total is avoidable waste you can eliminate through fine-tuning.WarningDo not estimate—measure actual token usage from your billing dashboard. Estimates are typically 3–5x off from reality.
- Train stable knowledge into model weightsPrioritize fine-tuning or post-training your model on the stable-knowledge categories with the highest query frequency and token cost. After training, those queries no longer require context retrieval steps.Pro tipStart with the single highest-cost stable knowledge category. A focused fine-tuning run often pays for itself within the first week of deployment.
- Maintain retrieval pipelines only for dynamic dataKeep tool calls, web search, and RAG pipelines active only for knowledge that changes faster than your training cadence—live market data, breaking news, and same-day events.WarningIf you route dynamic data into weights the model will answer with confidently wrong outdated information, destroying user trust quickly.
- Benchmark token usage and latency after reroutingRe-run your baseline queries after fine-tuning and compare token count, response latency, and accuracy against the pre-training baseline. Use results to identify the next batch of stable queries to migrate.Pro tipExpect 50–100x token reduction on formerly retrieval-heavy stable queries and 5–10x latency improvement on those query types.
Asked about Abraham Lincoln, a well-trained model answers instantly because Lincoln's history is already in its weights—no search required. Asked about Madonna's live performance at Coachella last weekend, the same model must trigger a web search because that event post-dates its training data. The performance difference is stark: the weights-based answer is sub-second; the retrieval-based answer may take 30–60 seconds. This contrast defines precisely when each architectural layer should be used.
If Uber's growth story from 2019 to 2024 is not in the model's weights, answering 'How did we grow from 2019 to 2024?' requires reading every quarterly report and earnings call from that period. This retrieval process can take over two hours and cost up to $1,500 in tokens per run. The same query on a model trained on that history resolves instantly at near-zero marginal token cost because the answer is already embedded in the weights.
Extracted from This Week in Startups (E2278), from an analogy used by Josh Cerot (Aragon CEO) contrasting Abraham Lincoln knowledge (in weights) with live Coachella updates (tool calls) to explain where different types of company knowledge belong architecturally.