The breathless 2024 framing of AI agents has matured into something more useful and considerably duller: a loosely-coupled chain of model calls, tool invocations, and deterministic glue, doing tasks a junior person used to do, at fractional cost. The problem is that most teams have not done the cost arithmetic, and the spreadsheet they think they're running is not the spreadsheet they're actually running.
I run an orchestration layer that handles tens of thousands of agent tasks a month across a hybrid local/cloud substrate. I have been forced to be honest about the per-task cost because the bills add up, the latency adds up, and the failure modes especially add up. This piece is the version of the unit-economics conversation I wish someone had handed me eighteen months ago — written down, with numbers, with the parts that don't survive contact with reality flagged as such.
What an agent task actually is
Before we can talk about unit cost we have to agree on what we're costing. An agent task, for the purposes of this piece, is one user-visible outcome that requires one or more model calls, possibly with tool invocations between them. "Summarise this email and decide whether it needs a human reply" is one task. "Plan a multi-step research report" is one task even if it cascades into thirty subordinate calls underneath. The unit we care about is the customer-meaningful outcome, not the model call.
This matters because the popular metric — cost per token — is misleading at the task layer. A task that requires three large-context calls on a closed frontier model can run twenty times the cost of a task that solved the same problem with five small-context calls on a 7B local model with a tool-use harness. Cost-per-token does not capture the structural choice. Cost-per-task does, and cost-per-task-at-acceptable-quality even more so.
The first discipline of running an agent stack at any scale is logging cost-per-task into a time-series database alongside the latency, the model identity, and the success indicator. Without that, you are flying blind, and the bill will surprise you the way bills always do.
Token economics in 2026 — the headline numbers
Frontier closed models continue to set the upper end of the cost curve. Mid-tier closed models — the workhorse APIs from the major frontier labs — sit somewhere between two and ten dollars per million output tokens depending on tier and provider. Smaller distilled closed models drop below the dollar mark per million output. Open-weights models — Llama 3.3 70B, Mistral-class mid-size models, smaller Llama variants — cost essentially nothing per token if you are running them on hardware you already own and amortising correctly. They cost a few cents per million tokens if you rent them from a hosted open-weights endpoint.
The orders-of-magnitude gap between local and closed-frontier per-token cost is the single most important number in agent economics. It means the choice of where a sub-call lands is far more consequential than the choice of what prompt sits inside it. Get the dispatch right and the prompt becomes a tuning detail. Get the dispatch wrong and no amount of prompt engineering will save the bill.
The chart below shows representative cost per 1,000 routine extraction-and-classification tasks across four lanes. These are the lanes that dominate operational agent traffic — the long tail of routing, structured-output and short-form judgement that adds up to most of the work.
The cost-per-1K-tasks reality, by task type
The right framing is task-type-aware. The same lane is not the right answer for every task type. Below is the cost-per-1K-tasks table I run, with rough numbers that hold up across our actual traffic. They are not benchmark numbers; they are operating numbers.
| Task type | Local 7B (own hw) | Open-weights 70B (rented) | Mid-tier closed | Frontier closed |
|---|---|---|---|---|
| Intent classification (short) | $0.05 | $0.40 | $1.10 | $8.50 |
| Structured JSON extraction | $0.15 | $0.80 | $2.40 | $14.00 |
| Email triage and short reply draft | $0.30 | $1.40 | $3.80 | $22.00 |
| Long-form summarisation (2K context) | $0.90 | $3.20 | $8.00 | $45.00 |
| Multi-step research with tool use | $2.40 | $9.50 | $26.00 | $140.00 |
| Genuinely novel reasoning (long-context) | fails too often | borderline | $48.00 | $220.00 |
Two things jump out. First, the gap between lanes is large enough that the dispatch decision dominates everything else. Second, the bottom row — genuinely novel reasoning — is the only row where a frontier closed model is unambiguously the right choice. Everything above it has a local or open-weights answer that is good enough for production, often better for production, because the latency and the audit story are cleaner.
Latency: the second axis nobody costs
Per-task cost is not the only operational variable. Latency is the other one, and most teams underweight it because it does not appear on a bill. A frontier closed model adds 800ms to several seconds of round-trip time before the first token, on top of provider-side queueing under load. A local 7B model returns first tokens in tens of milliseconds and runs throughput north of 50 tokens per second on competent hardware. For synchronous user-facing workflows this is not a tuning question; it is the difference between feeling responsive and feeling broken.
The discipline I run by: synchronous user-facing tasks land local by default and only escalate to a closed-frontier endpoint when the local lane has explicitly failed an evaluation gate. Asynchronous batch work — overnight digest, briefing assembly, scheduled analysis — has a wider tolerance and can take whichever lane gives the best cost-per-quality. The two regimes have different binding constraints; do not let one architecture serve both badly.
The small-model-with-tools finding
The empirical result that has shifted my architecture more than any other in the last twelve months is this: a small open-weights model with structured tool use beats a large closed model used in isolation, on most production agent tasks. The mechanism is straightforward. The small model is a controller; the tools — a search index, a database query, a code executor, a structured-extraction subroutine — do the heavy lifting. The controller does not need frontier reasoning to decide which tool to invoke; it needs reliable function-calling and obedient structured output. Both are properties that 7B–13B open-weights models hit at production quality in 2026.
The implication is that the right move on most agent workloads is not to throw a bigger model at the problem; it is to give a smaller model better tools. A model that can query a knowledge graph, run a calculator, and check a structured schema will outperform a frontier model running on prompt engineering alone, on the long tail of business workflows. It will also cost between one and two orders of magnitude less and run between five and twenty times faster.
This shifts where the engineering effort should sit. Build the tools. Wire them carefully. Treat the small model as a dispatcher whose job is to route, not to reason. The reasoning lives in the tools and in the deterministic glue around them.
When the closed frontier still wins
Three categories where I deliberately route to a closed frontier model, even at the cost premium:
- Genuinely novel reasoning over a domain the open-weights ecosystem has not yet absorbed. The standard example is reading a long, dense legal document and inferring the unstated obligations. Open-weights models are closing the gap fast, but the frontier is still the frontier on this kind of work.
- Long-context coherence over 100K-token-plus inputs where chunked retrieval would compromise the answer. Frontier closed models hold this together better than the open-weights alternatives at present.
- Multimodal cross-reasoning — combining vision, audio, and text in a single inference step at frontier quality. The open-weights gap on individual modalities is small; the cross-modal frontier is wider and changes slowly.
For everything outside those categories I try the open-weights lane first and only escalate when the eval data forces it. This is the opposite of how most teams operate. Most teams default to the frontier, and only consider local when the bill becomes impossible to ignore. Reverse the default and the architecture stays sane.
The hidden cost: orchestration overhead
Per-call cost is the visible part. The invisible part is orchestration overhead: retries on transient failures, tool-call round-trips, validator passes, schema-checking re-prompts, the second model call you make to verify the first. A naive implementation can multiply the visible cost by three or four without anyone noticing, because the wrapper code is not under the same observability as the call itself.
The discipline that catches this is end-to-end task accounting. Every customer-meaningful outcome carries a task ID; every model call, tool invocation, retry, and validation pass logs against that ID. At end-of-day the cost rollup is by task ID, not by call. The first time a team does this honestly, the conclusion is usually the same: somewhere between thirty and sixty percent of model spend is on retries, validations, and re-prompts that the original architecture did not account for. Fix the architecture before fixing the prompt.
The amortisation question on hardware
The cost-per-1K-tasks numbers above for the local lane assume hardware you already own. The honest answer to "what does the local lane really cost" depends on amortisation. A capable Apple Silicon inference node — adequate for a small team's full agent traffic on small and mid-size models — costs in the low thousands of pounds, has a useful life of three to five years, and consumes hundreds of watts under load. Spread across a year of traffic in the tens of thousands of tasks per day range, the per-task hardware amortisation rounds to fractions of a penny. At scale below that, the amortisation case is weaker and the rented open-weights lane often wins.
The crossover point in our experience: somewhere around 5,000–10,000 agent tasks per day of routine work. Below it, rented open-weights endpoints are easier to justify. Above it, owning the hardware pays off within months and the marginal cost of additional traffic falls toward zero. The answer is workload-dependent and the only honest way to find it is to log the data and run the comparison.
The architectural shape that survives
The orchestration architecture I see surviving across teams that have done this for two years or more has a few invariants:
- A model router as a first-class component, with explicit lanes and explicit dispatch rules, observable on every call.
- Tool-first design — most agent capability lives in the tools the controller invokes, not in the controller's prompt.
- Per-task accounting at the task ID level, not the call level — so retries, validations, and tool round-trips roll up into the right bucket.
- Quality gates per lane — automated evals that gate which workloads can run on which lane, so capability erosion gets caught before it ships.
- An explicit fallback chain — if the local lane fails, where does the call go? Codify it; do not let it default to the most expensive lane silently.
None of these are exotic. All of them are disciplines that compound. The teams that get this right end up with an orchestration layer that gets cheaper to run as the open-weights ecosystem improves, because the router migrates workloads down the cost curve without touching application code. The teams that get it wrong end up locked to whichever endpoint they bolted on first, watching their bill scale with their traffic.
The unit economics of agent orchestration are not subtle once you actually log the data. The cost gap between lanes is large; the dispatch decision dominates the prompt-engineering decision; the small-model-with-tools pattern beats the big-model-alone pattern on most production work; and the hidden cost of retries and validations is bigger than the visible cost of any individual call. None of this is contentious. It is just that very few teams have built the observability to see it.
If you are running an agent stack and you cannot tell me, in real time, the cost-per-task by task type, the lane each task landed in, and the retry-and-validation overhead, you are not running an agent stack — you are running a science project that occasionally produces useful output. The first piece of work, ahead of any model upgrade or any prompt revision, is the accounting. Everything else compounds from there.
Get on the newsletter Long-form thinking on sovereign infrastructure, agent orchestration economics, and the operating systems for businesses that refuse to depend on fragile intermediaries. No noise, no upsell. Join the newsletter →