The pattern is consistent across AI-native businesses I have worked with or advised. Year one, the AI bill is a rounding error and nobody pays it much attention. Year two, the bill has grown by an order of magnitude and finance starts asking pointed questions. Year three, the AI line is the second-largest OpEx item after payroll and the conversation about cost discipline is happening at the executive level, six months later than it should have.
The bill grows for the same reason any infrastructure bill grows — usage scales faster than expected and the team did not instrument cost discipline until the cost became impossible to ignore. The discipline itself is not complicated, but it has to be built into the architecture, not bolted on afterwards. This piece is the working operator's framework I have built across several stacks and refined the hard way. None of it is theoretical; all of it is the product of looking at bills that were larger than expected and figuring out what to do about them.
Cost-per-task is the only metric that matters
The popular cost metric for AI work is cost-per-token, and it is misleading at the application layer. Tokens are a unit the inference vendor cares about; the team running the application cares about delivering useful outcomes. A task that solved the user's problem at one cost is not comparable to a task that delivered a worse outcome at half the cost; the right comparison is cost-per-acceptable-task, and that requires defining what acceptable means.
The discipline that separates teams in cost control from teams getting surprised by their bill is the cost-per-task accounting layer. Every customer-meaningful outcome carries a task identifier; every model call, tool invocation, retry, validator, and re-prompt logs against that identifier; the rollup at end-of-day is by task identifier, not by call. The first time a team does this honestly, the conclusion is consistently the same: somewhere between thirty and sixty percent of model spend is on retries, validators, and re-prompts that the original architecture did not account for. The cost-per-call view hides this; the cost-per-task view exposes it.
The architectural implication is that cost optimisation is mostly an architecture problem, not a prompt problem. The expensive failures live in the orchestration layer — tools that get called with bad arguments, retrievals that return wrong documents and trigger retries, validators that reject outputs that have to be regenerated. Fixing the architecture removes whole categories of cost; tuning the prompt at best optimises within a category.
Routing logic — the highest-leverage cost lever
The single most consequential cost decision in any AI stack is the routing layer. Where does each task land? On a small local model with tools? On an open-weights mid-size model with a hosted endpoint? On a mid-tier closed model? On a frontier closed model? The answer dominates everything else. The cost gap between lanes is between one and two orders of magnitude on routine tasks, and the dispatch decision is what allocates traffic across lanes.
The discipline that survives is task-type-aware routing with explicit lanes and explicit dispatch rules. Most production agent traffic is dominated by a small number of task types — intent classification, structured extraction, short-form generation, retrieval-grounded answer generation — and each of these has a right lane that is rarely the frontier closed model. The frontier closed model is the right answer on a narrow class of work: genuinely novel reasoning, very-long-context coherence, multimodal cross-reasoning. Outside those classes, the open-weights or local lane is the cost-correct answer and the quality difference is small to invisible.
| Task type | Right lane | Cost-per-1K-tasks (right lane) | Cost if defaulted to frontier |
|---|---|---|---|
| Intent classification | Local 7B with tools | $0.05 | $8.50 |
| Structured JSON extraction | Local 7B with grammar | $0.15 | $14.00 |
| Short reply drafting | Local 13B | $0.30 | $22.00 |
| Retrieval-grounded Q&A | Mid-size open-weights | $0.95 | $28.00 |
| Long-form summarisation | Mid-size open-weights | $3.20 | $45.00 |
| Multi-step research | Mid-size open-weights w/ tools | $9.50 | $140.00 |
| Genuinely novel reasoning | Frontier closed | $220.00 | $220.00 |
The pattern in the table is the pattern in the bill. A team that defaults to the frontier closed model for everything is paying a multiplier somewhere between 50x and 200x on tasks where the right lane was much cheaper. That multiplier compounds over a year of traffic into bills that are large enough to have strategic consequences.
The small-model-with-tools finding
The empirical result that has shifted the cost-discipline conversation more than any other in the last twelve months is this: a small open-weights model with structured tool use beats a large closed model used in isolation, on most production tasks, at a fraction of the cost. The mechanism is straightforward. The small model is a controller; the tools — a search index, a database query, a code executor, a structured-extraction subroutine — do the heavy lifting. The controller does not need frontier reasoning to decide which tool to invoke; it needs reliable function-calling and obedient structured output. Both are properties that 7B-13B open-weights models hit at production quality in 2026.
The cost implication is dramatic. The same task that costs $0.10 routed to a small local model with tools costs $5-15 routed to a frontier closed model with the same prompt. The quality difference, on the kinds of routine business workflows that dominate agent traffic, is small to invisible. Sometimes the small-model-with-tools answer is better, because the tools enforce structure that the frontier model would have to be reminded of in the prompt.
This shifts where the engineering effort should sit. Build the tools. Wire them carefully. Treat the small model as a dispatcher whose job is to route, not to reason. The reasoning lives in the tools and in the deterministic glue around them. Every dollar invested in tool quality compounds as the team migrates more workloads to the small-model-with-tools pattern.
Caching — the discipline most teams skip
The most under-utilised cost lever in production AI stacks is caching. The pattern is consistent: teams cache HTTP responses, database queries, and computed results, but somehow do not cache model calls. The result is that every call to the model — even calls with identical inputs minutes apart — pays full inference cost.
The caching disciplines that matter, in rough order of impact:
- Response caching with input hashing. A model call with identical prompt, system prompt, and parameters returns the cached response. Sounds trivial; the hit rate on real production traffic is consistently 15-40% because real traffic has more repetition than teams realise.
- Prefix caching. When the system prompt and early context are identical across calls, the inference engine can cache the computed attention state and skip recomputation. Hosted endpoints expose this as automatic prompt caching; self-hosted inference engines like vLLM support it natively.
- Embedding caching. Embeddings are deterministic; an embedding for the same text is the same embedding. Caching embeddings prevents the team from re-embedding the same documents repeatedly during corpus operations.
- Tool-result caching. Many tool calls — search queries, knowledge-graph lookups, deterministic computations — produce the same result for the same input. Caching tool results avoids paying the controller-token-cost for re-asking the controller to re-invoke the tool with the same arguments.
The combined effect of these four caching disciplines, on a representative production stack, is a 25-50% reduction in inference cost without any change to the application logic or quality. The implementation cost is modest. It is the lowest-hanging fruit in cost discipline, and the discipline most teams have not yet implemented.
Prompt compression — the diminishing-returns lever
Prompt compression — making prompts shorter while preserving the behaviour they elicit — is the lever that most teams reach for first and that produces the smallest impact relative to architecture changes. The cost gain from cutting a prompt from 800 tokens to 400 tokens is real but bounded; the cost gain from routing the task to a different lane is an order of magnitude larger. Prompt compression is worth doing, but it is the polish, not the renovation.
That said, the disciplines that matter at the prompt-compression layer:
- Move static instructions to the system prompt where prefix caching can absorb them.
- Use structured output formats (JSON schemas, grammars) instead of long prose instructions describing the format.
- Strip examples that the model already handles correctly without them; keep only the examples that demonstrably change behaviour.
- Compress retrieved documents before they enter context — a summary often serves the answer better than the full passage, particularly when the answer space is bounded.
None of these are dramatic. Together, applied consistently, they trim 20-40% off prompt token usage on a typical stack. Combined with caching and routing, the cumulative effect is a stack where the cost per task is meaningfully lower than the headline rates would suggest.
Telemetry that actually changes behaviour
The cost telemetry that matters is the telemetry that gets surfaced to the people making engineering decisions. A daily cost dashboard that nobody looks at is not telemetry; it is a graveyard. The disciplines that compound:
- Cost per task type, surfaced weekly. The team should know, without having to ask, what each task type costs and how that has trended over the last quarter. Anomalies become visible early, before they become surprises.
- Cost per lane, surfaced weekly. If frontier-closed traffic is growing faster than open-weights traffic, that is a signal something has shifted in the routing layer or in the workload mix. Surface it before it surprises finance.
- Cost per failure mode. Retrieval misalignment that triggers re-prompts has a cost; hallucinated tool arguments that trigger replays have a cost; schema drift that triggers regeneration has a cost. Surface them separately so the engineering work targets the modes that dominate spend, not the modes that are easiest to fix.
- Cost rollup tied to outcome, not cost. A pricey task that delivered the right answer is fine; a cheap task that delivered the wrong answer is not. Surface cost-per-acceptable-outcome, not cost-per-call, and the optimisation work targets the right metric.
The chart below shows the consistent shape of where AI spend ends up landing in well-instrumented stacks at scale. The bar lengths are illustrative but the relative ranking is the pattern across teams I have compared notes with.
The shape is the lesson. Routine traffic dominates, retries and validators are the second-biggest line and the most addressable through architecture, tool overhead is meaningful and often invisible, and frontier reasoning is a small share of total spend even though it is the most expensive per call. Optimisation effort should follow the shape.
The OpEx trajectory and what to do about it
The honest projection for AI-native businesses is that the AI line will become the second-largest OpEx item after payroll within three to five years of the business reaching meaningful scale. This is not a doom prediction; it is the natural consequence of AI capability becoming cheaper per task while the number of tasks the business runs grows faster. The total spend grows even as the unit cost falls, which is the same pattern every infrastructure cost has followed in the cloud era.
The teams that survive this comfortably are the ones that built the cost-discipline architecture early — the routing layer, the caching disciplines, the cost-per-task accounting, the telemetry that surfaces decisions to engineers. The teams that did not are the ones running emergency cost-reduction projects in year three, with all the disruption to feature development that implies.
The right mental model is that AI spend is going to be a permanent part of the OpEx structure of any business that is using it well, and that structure deserves the same architectural discipline as any other large infrastructure line. Treat it accordingly from the start. The cost compounds slowly and then suddenly; the discipline only works if it was there from the beginning.
AI cost discipline is not a finance function bolted on after the bills arrive; it is an architectural discipline that has to be designed in at the start. The high-leverage levers are routing, tools, and caching; the low-leverage levers are prompt compression and individual call tuning. The teams that get this right end up with bills that are predictable and that scale gracefully with traffic. The teams that get it wrong end up with bills that surprise their finance teams, often enough that the conversation about AI shifts from how do we capture more value to how do we stop the bleeding, which is the wrong conversation to be having three years into a build.
The right first investment is the cost-per-task accounting layer. Without it, every other optimisation is guesswork. With it, the engineering priorities reorder themselves automatically — routing first, caching second, architecture changes third, prompt tuning a distant fourth. The teams running disciplined AI stacks in 2026 are not the ones with the cleverest prompts; they are the ones who knew, at every point, what each task cost and why.
Get on the newsletter Long-form analysis on sovereign infrastructure, AI operations, and the cost disciplines that compound across years. Once a fortnight, no upsell. Join the newsletter →