JW · Josh Weir
← AI Systems
Spoke · AI Systems

Cost per task, not cost per token — the right unit for AI economics

The AI industry has trained operators to think about cost in tokens. The marketing material is in tokens, the dashboards are in tokens, the comparison tables between providers are in tokens. This is a category error. Tokens are an implementation detail. The unit that matters to a business is cost per useful task completed, and it is almost never what the per-token math suggests.

I have watched too many teams compare two providers by their headline per-million-token rate, choose the cheaper one, and ship an architecture they will regret in six months because the cheaper provider needed three calls to do what the more expensive one did in one. The correct unit collapses that confusion. This piece is the framework I use to do the calculation honestly.

Why per-token pricing misleads

Per-token pricing is a technically honest way for the provider to bill, because tokens are what they consume. It is a structurally misleading way for an operator to plan, because it conflates three different things into one number.

  • Input tokens are heavily influenced by the size of the context the prompt carries — the system prompt, the few-shot examples, the retrieved knowledge. A clever team can cut input tokens by half with retrieval discipline. A careless team can bloat them by a factor of ten without noticing.
  • Output tokens are heavily influenced by the structure of the response. A free-form answer is longer than a constrained schema-fill. A model that rambles is more expensive than one that does not.
  • Per-task call count is the elephant in the room. A model that gets the right answer in one call is fundamentally different from a model that needs three retries. The per-token rate of the first model can be triple the rate of the second, and it will still be cheaper per task.

The headline number does not capture any of this. Cost per task does.

How we measure cost per task

Every workflow we run writes a measurement to the time-series store at the end of each execution. The measurement includes total tokens in, total tokens out, total currency cost across all model calls in the workflow, and a flag for whether the task succeeded. The time-series store knows how to aggregate. We can ask: over the last thirty days, what is the average currency cost of one successful execution of this workflow?

That single number — currency per successful task — is the unit on which we make decisions. It absorbs the call count, the retry behaviour, the input bloat, the output verbosity, and the failure rate into one comparable measure.

When we evaluate a new model or a new provider for an existing workflow, we run a controlled comparison: the same task, ideally the same hundred examples, against the candidate. We measure cost per successful task on each side. We compare. The headline rate is irrelevant. The aggregate is the answer.

The non-obvious places cost hides

Once you start measuring cost per task, you discover the non-obvious places where it hides.

Retries on flaky structured output. If a model returns malformed JSON ten percent of the time and the workflow retries until it succeeds, the effective cost per successful task is ten percent higher than the per-call cost suggests. Switching to a model with stricter structured-output adherence can cut cost per task even if the headline rate is higher.

Retrieval bloat. A workflow that pulls fifty thousand tokens of context and only uses two thousand of them is paying forty-eight thousand tokens of input tax per call. Tightening the retrieval pipeline often saves more cost than switching providers.

Hidden chains. A workflow that looks like one task in the dashboard often involves three or four model calls under the hood — a classifier, a generator, a critic, a refiner. The cost per task is the sum across the chain, not the cost of any single call.

Failed tasks. Tokens spent on tasks that did not succeed are still tokens spent. If your success rate is eighty percent, your effective cost per successful task is twenty-five percent higher than the cost per attempted task.

Designing for cost-per-task discipline

Once you measure cost per task, the architectural moves that reduce it become obvious.

  • Local first. The cheapest model that can credibly do the task is the one running on your own hardware at zero marginal cost. Route to it first; escalate only on failure.
  • Schema-constrained output. A model writing JSON to a declared schema produces fewer tokens, fewer retries, and fewer downstream parse failures than the same model writing free-form prose.
  • Tighter retrieval. Top-k cut hard, with reranking on the candidates, beats top-k loose. Less context, better focus, lower cost.
  • Critic with cheaper models. The critic-and-refiner pattern uses two calls per task but often produces a higher success rate at lower aggregate cost than one call with a frontier model.
  • Cache the deterministic part. Many workflows have a deterministic preamble that does not change between calls. Caching that part — at the prompt level or the application level — eliminates redundant token spend.

The numbers we see in practice

For our own production workflows, cost per task ranges from effectively zero (entirely local, no cloud calls) to a few pence (light cloud assist) to a small number of pounds (frontier-class single-shot tasks like long-document analysis). The distribution is heavily skewed: the median workflow costs under a penny per task, the ninety-fifth percentile under twenty pence, and the long tail of high-cost frontier work accounts for most of the absolute spend.

The implication for cost management is that the high-leverage move is almost never to renegotiate the per-token rate on the median workflow. It is to identify the small number of expensive workflows and reduce their cost per task — usually by routing more of them to local models, tightening retrieval, or restructuring them as multi-call chains using cheaper components rather than single-shot frontier calls.

The takeaway

Per-token pricing is the unit the provider sells in. Cost per task is the unit the business operates in. Translating between them is the job of the operator, and the gap between the two is where the most consequential AI architecture decisions get made.

If you do not yet measure cost per task on every workflow you ship, that is the first instrumentation to add. The measurement is cheap. The decisions it enables are not.

Working on this?

For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.

Get in touch

Search terms this article addresses

ai cost optimisation enterprise ukcost per task vs cost per tokenllm cost analysis frameworkai workflow cost trackingmodel routing cost optimizationai infrastructure economicsproduction ai cost disciplineoperator-grade ai economics

Related under AI Systems