The cleanest question I get from technically-fluent founders is also the most consequential: where should our model calls actually happen? Not which model. Not which prompt framework. Not which agent pattern. Just: who owns the silicon when the request lands. That single decision propagates through the rest of the system. It dictates unit economics, regulatory posture, latency, your supplier lock-in and the kind of failure modes you'll inherit for the lifetime of the product.
I run a hybrid setup — local inference on a small cluster for everything I can keep there, hyperscaler endpoints only when the workload genuinely requires frontier capability. Over the last eighteen months that split has hardened into a framework I now apply to every new project before I write a line of code. This piece is that framework, written down. It is opinionated and it is honest about where each side breaks.
Why the question matters more than people pretend
Most teams start with cloud AI because it is frictionless. You sign up, paste in a key, ship a feature. The bill is thirty pounds the first month and nobody panics. Six months later the bill is four thousand pounds, the model you depended on has been deprecated, your traffic has tripled in line with a use case you didn't predict, and the legal team is asking which jurisdiction the inference is happening in for the GDPR audit.
The reason this gets ignored at the start is that early-stage product teams treat the model API as if it were the database. It is not. A database holds your state; a model holds someone else's capability, leased to you on a per-call meter. When that capability changes, breaks, gets repriced or gets restricted to your region, you do not own anything that survives the change. You own the prompt. You do not own the answer.
Sovereign AI starts from the opposite premise: the inference layer is part of your product, not an external service. You may not run frontier-scale models on day one — almost nobody does — but you treat the inference position as strategic, not tactical. The question is not whether to be hybrid. The question is which workloads sit on which side and why.
The four-axis framework
I score every workload on four axes before deciding. Each one runs from 0 to 5. The total tells you where to land:
- Sensitivity. How comfortable are you sending this data to a third party that may train on it, log it, or be subject to subpoena in a jurisdiction you don't operate in? Client deal data, health information, contracts under negotiation — these score 5. A Wikipedia summary scores 0.
- Capability ceiling. Does the workload actually require frontier reasoning? Honest answer required. Most workloads — extraction, classification, structured output, summarisation, sentiment, draft-quality writing — do not. They score low. Genuine multi-step reasoning over novel domains scores 5.
- Volume. How many calls per day, scaling out to 12 months? At 100 calls/day you don't care about per-call cost. At 100,000 calls/day you care about almost nothing else.
- Latency budget. Is this a synchronous user-facing call (low budget) or an asynchronous batch (high)? Local inference is round-trip-free; hyperscaler inference adds 200–600ms minimum and jitter on top.
If a workload scores high on sensitivity and volume but low on capability ceiling, it is a textbook local candidate. If it scores low on sensitivity, low on volume, and high on capability ceiling, the cloud is fine. The interesting ones are in the middle, and that is where most of the engineering time goes.
The local inference layer in practice
In our setup the local inference cluster runs on Apple Silicon — a single M-class machine doing the bulk of the work, with overflow into a small CPU-based fallback. The model lineup is conservative: a small instruction-tuned model from the Llama family for classification and extraction, a mid-size model from the Mistral family for summarisation and structured output, and a coding-tuned model for anything that touches our automation engine. We use GGUF quantisations — typically Q4_K_M or Q5_K_M depending on how aggressive we want to be on memory.
The thing to internalise is that most production AI work in 2026 does not need a frontier model. Sentiment analysis, intent classification, named-entity extraction, structured-JSON output for downstream automation, simple summarisation — a 7B-parameter model running locally clears these tasks at quality that is indistinguishable from the cloud, at a marginal cost of zero. The unit economics flip almost immediately on volume.
What you give up is consistency at the edges. Local models hallucinate more at long context. They are weaker at chain-of-thought. They are noticeably worse at tasks that need world knowledge well outside their training window. You design around that by routing those workloads — explicitly — to the cloud, and you build the router rather than the routing into your prompts.
Where the cloud still earns its keep
I am not a sovereign-AI ideologue. There are workloads where the cloud is correct and arguing otherwise is just performance.
- Frontier reasoning over genuinely novel material. If you need a model to read a 200-page contract, infer the unstated obligations, and produce a redline — you are paying for capability that doesn't yet exist in any open-weights model. Pay it.
- Long-context summarisation where the document exceeds what you can fit in your local model's effective working context. There are ways around this with chunking and retrieval, but a single 200K-token call on a frontier model is sometimes just cheaper in engineering time.
- Multimodal work at high quality — image, audio, video understanding. The open-weights ecosystem is closing the gap fast on individual modalities (Whisper-class transcription is already commodity locally), but cross-modal frontier work is still a hyperscaler advantage.
- Genuinely bursty traffic where capital expenditure on hardware to handle a peak you only see twice a year would be ruinous. Cloud absorbs the spike for the price of a meter-reading.
The trick is to build the boundary cleanly so the cloud can be the cloud and the local layer can be the local layer, neither pretending to be the other.
The model router as a first-class citizen
The single most important architectural component in a sovereign-AI setup is the model router: a small piece of infrastructure that takes a request and decides where it goes. Ours is opinionated. It looks at the task type (classification, summarisation, generation, reasoning, coding), expected input length, sensitivity flag and an explicit cost budget, then dispatches the call to one of: local quantised model, regional open-weights endpoint we control, hyperscaler endpoint with enterprise data terms, or hyperscaler endpoint without data terms (only used when sensitivity is zero and the tradeoff genuinely makes sense).
The router has a fallback chain. If the local cluster is saturated, it spills into the regional endpoint. If that fails, it falls through to a paid endpoint. The whole thing logs cost-per-call and latency-per-call into a time-series database, and we run dashboards on the ratio. This matters because the assumptions you made when you chose a workload's lane will change as the model landscape moves. A workload that needed cloud in 2024 might fit a 7B local model in 2026. The router lets you migrate without touching application code.
If you're building this from scratch, my advice is to start with the router, even if version one routes everything to the same endpoint. The architectural shape you set on day one tends to survive the next five years.
Regulatory and data-sovereignty reality check
For UK and EU operators, the regulatory pressure on cloud AI is moving in one direction: tightening. Even when the data terms are commercially favourable, you are still in a position where a model provider's policy change, a sub-processor change or a jurisdiction-of-record change can land in your inbox with thirty days' notice. Local inference removes that whole class of risk. The data never leaves the boundary.
I work on commodity-trade verification, integrated eco-development project research, and content for HNWI clients. In all three, the discretion question is non-negotiable: certain documents simply cannot leave the room. That alone makes the local layer non-negotiable. Whether or not it's cheaper, it has to exist for those workloads. Once it exists, the unit economics make it the default for everything else that fits.
Common failure modes in sovereign AI rollouts
Three patterns I've watched teams blow up on:
- Overestimating model capability. A 13B local model is not a frontier model. If you build a workflow that assumes frontier reasoning and quietly downgrade to local, you will ship hallucinations into production. The fix is to write evals against each lane individually and gate workloads on those results, not on vibes.
- Treating quantisation as free. Q4 quantisation is fine for many use cases, but it degrades structured-output reliability noticeably. If you depend on JSON schema adherence, test every quant level. Sometimes the right answer is Q5 with smaller context.
- Underbuilding the observability. If you cannot tell me, in real time, which workloads went where, what they cost, and whether they succeeded, you do not have a sovereign-AI stack — you have a science project. Time-series logging on every call is the entry-level requirement.
What this looks like when it works
In our environment, roughly 80% of inference calls hit the local layer. The 20% that go to the cloud are either deliberately frontier work or sensitivity-zero overflow. Per-month cost on the cloud side has been flat for nine months while the volume of AI-touched workflows has grown by an order of magnitude. The local cluster is paid off. The model router has saved more in deprecated-endpoint migrations than it cost to build.
None of this is exotic. The pattern is not novel. What is novel — and what most teams fail at — is the discipline to actually build the boundary rather than letting cloud convenience eat the architecture by default. The framework above is the version of that discipline I run by.
Sovereign AI is not a religion. It is a unit-economics decision wrapped in a regulatory decision wrapped in an architectural one. The teams that get it right treat the inference layer as part of their product surface — first-class, observable, with explicit lanes for explicit workloads. The teams that get it wrong wake up tied to a vendor whose pricing they don't control, whose roadmap they can't see, and whose deprecation cycles they can't predict.
If you build the model router on day one, the rest is mechanical. If you don't, you'll rebuild it under load eighteen months later, while in production. I've watched both. The first costs less, and the resulting system is the kind of thing that compounds quietly across years rather than degrading visibly across months.
Plan your sovereign AI stack Need a framework for deciding which workloads belong on your own infrastructure and which still need the cloud? Book a sovereign-infrastructure consultation and we'll design the boundary, the router and the observability together. Book a sovereign-infrastructure consultation →