Why per-token pricing misleads
Per-token pricing is a technically honest way for the provider to bill, because tokens are what they consume. It is a structurally misleading way for an operator to plan, because it conflates three different things into one number.
- Input tokens are heavily influenced by the size of the context the prompt carries — the system prompt, the few-shot examples, the retrieved knowledge. A clever team can cut input tokens by half with retrieval discipline. A careless team can bloat them by a factor of ten without noticing.
- Output tokens are heavily influenced by the structure of the response. A free-form answer is longer than a constrained schema-fill. A model that rambles is more expensive than one that does not.
- Per-task call count is the elephant in the room. A model that gets the right answer in one call is fundamentally different from a model that needs three retries. The per-token rate of the first model can be triple the rate of the second, and it will still be cheaper per task.
The headline number does not capture any of this. Cost per task does.
How we measure cost per task
Every workflow we run writes a measurement to the time-series store at the end of each execution. The measurement includes total tokens in, total tokens out, total currency cost across all model calls in the workflow, and a flag for whether the task succeeded. The time-series store knows how to aggregate. We can ask: over the last thirty days, what is the average currency cost of one successful execution of this workflow?
That single number — currency per successful task — is the unit on which we make decisions. It absorbs the call count, the retry behaviour, the input bloat, the output verbosity, and the failure rate into one comparable measure.
When we evaluate a new model or a new provider for an existing workflow, we run a controlled comparison: the same task, ideally the same hundred examples, against the candidate. We measure cost per successful task on each side. We compare. The headline rate is irrelevant. The aggregate is the answer.
The non-obvious places cost hides
Once you start measuring cost per task, you discover the non-obvious places where it hides.
Retries on flaky structured output. If a model returns malformed JSON ten percent of the time and the workflow retries until it succeeds, the effective cost per successful task is ten percent higher than the per-call cost suggests. Switching to a model with stricter structured-output adherence can cut cost per task even if the headline rate is higher.
Retrieval bloat. A workflow that pulls fifty thousand tokens of context and only uses two thousand of them is paying forty-eight thousand tokens of input tax per call. Tightening the retrieval pipeline often saves more cost than switching providers.
Hidden chains. A workflow that looks like one task in the dashboard often involves three or four model calls under the hood — a classifier, a generator, a critic, a refiner. The cost per task is the sum across the chain, not the cost of any single call.
Failed tasks. Tokens spent on tasks that did not succeed are still tokens spent. If your success rate is eighty percent, your effective cost per successful task is twenty-five percent higher than the cost per attempted task.
Designing for cost-per-task discipline
Once you measure cost per task, the architectural moves that reduce it become obvious.
- Local first. The cheapest model that can credibly do the task is the one running on your own hardware at zero marginal cost. Route to it first; escalate only on failure.
- Schema-constrained output. A model writing JSON to a declared schema produces fewer tokens, fewer retries, and fewer downstream parse failures than the same model writing free-form prose.
- Tighter retrieval. Top-k cut hard, with reranking on the candidates, beats top-k loose. Less context, better focus, lower cost.
- Critic with cheaper models. The critic-and-refiner pattern uses two calls per task but often produces a higher success rate at lower aggregate cost than one call with a frontier model.
- Cache the deterministic part. Many workflows have a deterministic preamble that does not change between calls. Caching that part — at the prompt level or the application level — eliminates redundant token spend.
The numbers we see in practice
For our own production workflows, cost per task ranges from effectively zero (entirely local, no cloud calls) to a few pence (light cloud assist) to a small number of pounds (frontier-class single-shot tasks like long-document analysis). The distribution is heavily skewed: the median workflow costs under a penny per task, the ninety-fifth percentile under twenty pence, and the long tail of high-cost frontier work accounts for most of the absolute spend.
The implication for cost management is that the high-leverage move is almost never to renegotiate the per-token rate on the median workflow. It is to identify the small number of expensive workflows and reduce their cost per task — usually by routing more of them to local models, tightening retrieval, or restructuring them as multi-call chains using cheaper components rather than single-shot frontier calls.
The takeaway
Per-token pricing is the unit the provider sells in. Cost per task is the unit the business operates in. Translating between them is the job of the operator, and the gap between the two is where the most consequential AI architecture decisions get made.
If you do not yet measure cost per task on every workflow you ship, that is the first instrumentation to add. The measurement is cheap. The decisions it enables are not.
Working on this?
For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.
Get in touch