TL;DRTwo years ago, running a local large language model on a desktop machine was a curiosity. Today, with the right hardware and a careful model lineup, it is a viable production substrate for a meaningful slice of any AI wo…

Two years ago, running a local large language model on a desktop machine was a curiosity. Today, with the right hardware and a careful model lineup, it is a viable production substrate for a meaningful slice of any AI workload. I run mine on Apple Silicon — specifically an M-class machine with shared unified memory acting as the inference backbone for an entire automation stack. It is not a toy. It serves real traffic. The cost-per-call is, for all practical purposes, zero.

This piece is the version of the setup guide I would have liked to read before I started: opinionated, specific where specificity matters, honest about the limits, and rooted in 2026 reality rather than the breathless 2024 reality where running a 70B model felt like a magic trick. The magic trick is over. What's left is engineering.

Why Apple Silicon is genuinely good at this

The architectural fact that makes Apple Silicon work for local LLM inference is unified memory. On a traditional GPU rig the model weights live in VRAM, separate from system RAM, and you are bottlenecked on VRAM size and PCIe bandwidth. On Apple Silicon the GPU and CPU share a single pool of high-bandwidth memory. A 64GB machine has 64GB available to the inference runtime. A 96GB machine has 96GB. There is no copy across a bus.

Combined with the Neural Engine and the Metal-accelerated math primitives, this means a mid-range M-class chip can hold and serve models that would require a four-figure GPU on the equivalent x86 path. For a single-developer or small-team setup running quantised 7B–13B models all day, the cost-performance curve is genuinely flattering. For 70B models, you need to step up the spec — and at that point the comparison gets more nuanced — but the substrate scales further than most people assume.

The trade-off is throughput at scale. A datacentre GPU will out-tokens-per-second a Mac at high concurrency. For a single inference at a time, or a handful of concurrent users, the gap is much smaller than the marketing suggests. Most teams' local workloads are exactly that: handful-of-concurrent.

Hardware: the sizing question

The honest framing of hardware sizing is: memory dictates which models you can serve, memory bandwidth dictates how fast they run. Compute throughput matters but is rarely the binding constraint at the scales most teams operate at.

My own rule of thumb for sizing:

  • 16GB — fine for a single 7B model at Q4 quantisation, with room for a mid-size embedding model and some context. Comfortable for small workloads.
  • 32GB — sweet spot for development and small production. Run a 13B model at Q4 plus an embedding model plus your application stack. Most people landing in this bracket are happy.
  • 64GB — comfortably hosts a 13B at Q5 or two 7Bs simultaneously, and gives you the headroom to load a 70B at aggressive quantisation if you need to test workflows. This is roughly where my main inference node lives.
  • 96GB+ — opens up 70B-class models at Q4 or Q5 with real working context. If your use case genuinely needs that capability ceiling, the spec is justified. Most don't.

Memory bandwidth varies between Apple Silicon generations and tiers — the higher tiers (Pro, Max, Ultra) ship more bandwidth than the base chip. For sustained inference, this is the single number on the spec sheet that matters most after RAM.

The model lineup we actually run

Our production lineup, as of right now, breaks down roughly like this:

  1. A small Llama-family instruction-tuned model at Q4 — used for classification, intent recognition, sentiment scoring, and short-form structured output. Fits comfortably into 8GB of working memory. Tokens-per-second comfortably above 50 on our setup; well into real-time territory.
  2. A mid-size Mistral-family model at Q5 — the workhorse for summarisation, longer-form structured generation, JSON-with-schema output, and most agent reasoning steps. Fits into ~14GB. Tokens-per-second around 30, which is fine for synchronous use up to a paragraph or two and ideal for asynchronous workloads.
  3. A coding-tuned model at Q4 — for our automation engine when it generates code-like artifacts (structured queries, scripts, configuration). Lighter capability than a frontier coder but adequate for templated tasks.
  4. An embedding model — small, lightweight, fast, used for retrieval against our knowledge graph. The embedding model is the unsung hero of any local stack: cheap, fast, and constantly running.
  5. An audio transcription model from the Whisper family — for voice notes, podcast prep, and any inbound voice content from the messaging stack. Faster-than-realtime transcription on Apple Silicon.

Notice what's not on this list: a 70B-class model. We deliberately don't run one as part of the core production lineup. When we need that capability, we route to a hosted open-weights endpoint or to a frontier API. The decision is: keep the local layer fast, lean and specialised; pay for capability on the edges.

Quantisation: the actual tradeoff

Quantisation is the technique of representing model weights at lower precision to save memory and increase throughput. GGUF — the file format we standardise on — supports a series of named quantisation levels. The ones that matter in practice:

  • Q4_K_M — 4-bit, mixed precision. Default we use for high-volume utility models. Quality drop versus full precision is small but visible at the edges (long-context retrieval, structured output adherence).
  • Q5_K_M — 5-bit, mixed precision. The quality-cost sweet spot for a workhorse model. Negligible quality drop on most tasks. Marginally slower and bigger than Q4.
  • Q6_K — 6-bit. Closer-to-full quality, larger memory footprint. We use this when a workflow has demonstrated structured-output instability at Q4/Q5.
  • Q8_0 — 8-bit. Almost indistinguishable from full precision but with twice the memory cost of Q4. Reserved for evaluation and edge cases where every percent of quality matters.

The most common mistake I see teams make is to default to Q4 because it's smaller and faster, then quietly accept the quality regression because they never measured the alternative. Run your evals against Q4, Q5 and Q6 of the same model. Pick based on the data, not the file size.

The runtime layer

For the runtime, the open-source ecosystem has converged on a small set of well-engineered options. We run a Metal-accelerated GGUF runtime that exposes a standard chat-completions HTTP API on a local port. Everything else in our stack — the orchestration layer, the agent runtime, the model router — talks to that API in the same shape it talks to any cloud endpoint. This is the architectural move that pays the most dividends: treat the local layer as just another endpoint. Application code should not care whether the model running behind POST /v1/chat/completions is on the desk or in a datacentre.

The runtime takes care of context-window management, KV-cache reuse for streaming, batched inference if your workload supports it, and graceful degradation when memory pressure hits. The thing it does not do well, on most stacks I've used, is multi-tenancy at the runtime layer. If you need to serve multiple distinct users with isolation, build that at the application layer above the runtime — don't rely on the runtime to enforce it.

Throughput numbers we actually see

To make this concrete, here is the rough throughput envelope on our main inference node — a single high-spec M-class machine with abundant RAM:

  • Small instruction-tuned model at Q4 — sustained ~60 tokens/second single-stream, comfortably hits real-time chat throughput.
  • Mid-size Mistral-family at Q5 — sustained ~30 tokens/second single-stream, still real-time for most chat use cases, generous for batch.
  • Coding-tuned at Q4 — sustained ~40 tokens/second single-stream.
  • Whisper-family transcription — comfortably 5–10× realtime on the same hardware for medium models.
  • Embedding throughput — thousands of small chunks per second.

These are not benchmark-suite numbers. They are what we observe in production with everything else on the box doing its day job. Your numbers will differ; the shape will not.

What still goes to the cloud

Even with this setup, a meaningful slice of work still goes to a hosted endpoint. The pattern: anything requiring genuine frontier reasoning, anything multimodal at high quality, anything that needs context windows beyond what we can comfortably batch locally. The model router (covered in the sovereign-AI piece) handles the dispatch, and we log cost-and-latency on both sides. The split has held at roughly 80/20 for nine months.

This is the right shape. The local layer is the default; the cloud is the specialist. Don't invert that without a reason.

Operational gotchas you'll hit

  1. Thermal throttling under sustained load. Apple Silicon will throttle if you pin it at 100% for hours. For batch workloads, build in cooldown windows.
  2. Memory pressure surprises. Loading a second large model on top of an already-loaded one without unloading the first will swap, hard. Be explicit about model lifecycle in the runtime.
  3. Context window != working context. Models advertised at 32K context often degrade noticeably above 8K–16K when quantised. Test at your actual context lengths.
  4. System updates can break runtimes. A major OS version change has, in my experience, broken Metal acceleration twice. Pin your runtime versions and don't update lightly.

The current state of local LLM inference on Apple Silicon is, frankly, better than most people think. Not as a hobbyist novelty — as a serious production substrate for the slice of your AI workload that fits the capability envelope. The hardware-cost-per-capability has crossed a threshold that makes it the obvious default for anything sensitive, high-volume or low-budget. The cloud still earns its keep at the frontier. Both lanes coexist; the architecture rewards taking both seriously.

If you've been watching this space and waiting for the right moment to build the local layer, the moment is now and the moment was probably also six months ago. The longer you wait, the more your application code calcifies around someone else's endpoint.

Build the local layer Sizing the right hardware, picking the model lineup, wiring the runtime into your existing stack — book a consultation and we'll design the local inference layer for your workloads. Book a sovereign-infrastructure consultation →