The pipeline, end to end
The local voice assistant is six stages.
- Wake word detection. A small model running continuously on a low-power microphone array, listening for a specific phrase. Only when the wake word fires does the rest of the pipeline activate.
- Audio capture. A short window of audio after the wake word, with voice-activity detection to determine when the speaker has finished.
- Speech-to-text. A Whisper-class model on the workhorse, transcribing the captured audio at faster than real-time. The transcript is the input to the rest of the pipeline.
- Intent and routing. A small local model classifies the transcript: is this a query, a command, a conversation, a request for music? The classification routes the request to the appropriate handler.
- Response generation. The handler executes — querying a database, running a workflow, calling a tool, generating a response with a language model.
- Voice synthesis. The response is rendered as audio by a local synthesis model, with a chosen voice profile, and played back through the appropriate speaker.
The entire round trip happens on hardware in the house. Nothing about the audio, the transcript, or the response touches a third party.
Wake word: the part that has to be tiny
Wake word detection runs continuously, which means it has to be small, fast, and power-efficient. The Apple Silicon workhorse is overkill for this; the wake-word component runs on a small dedicated edge device with a microphone array and enough compute to run a few-megabyte model.
The model is trained on a single phrase. It is robust to background noise, accents, and varying distance from the microphone. False positives are rare; false negatives are slightly more common but recoverable by repeating the phrase.
The architectural commitment is that the rest of the pipeline does not run unless the wake word fires. The microphone is not streaming audio anywhere by default. This is the privacy-meaningful difference between a local voice assistant and a cloud one: the device is genuinely listening only for the wake word, and everything else is opt-in by activation.
Speech-to-text on a workstation
The transcription component is the part that was hardest until recently and is now easy. Whisper-class models in their smaller variants run faster than real-time on Apple Silicon, with quality good enough for the household-command use case at all common volumes and accents.
We host the transcription model behind a small HTTP service on the workhorse. Every other component that needs transcription — the voice assistant, voice notes, podcast processing, meeting transcription — calls the same service. Centralising means we maintain one model, one configuration, one set of language packs, and we can upgrade in one place.
Transcription quality is the part of the pipeline that determines whether the assistant feels intelligent or stupid. A misheard command is the most common failure mode in voice assistants. The local stack has no inherent quality disadvantage versus the cloud — and it has the advantage that we can choose model size and configuration based on our actual usage rather than the provider's optimisation for cost.
Intent routing without an LLM (mostly)
Most household voice commands are not free-form. “Lights off in the kitchen.” “What is the weather tomorrow?” “Set a timer for fifteen minutes.” A small classifier with a few dozen training examples covers the vast majority of these.
For commands the classifier recognises with high confidence, the routing is direct: the command is parsed, the parameters are extracted, the appropriate tool is called, the response is templated. No language model is involved on the response path. Latency is low and the behaviour is deterministic.
For commands the classifier does not recognise, or for genuinely free-form questions, the request is routed to a local language model with the household's tool surface available. The model can call tools to retrieve information, run workflows, or generate a response. This path is slower and more variable, and it is reserved for the cases where the deterministic path does not apply.
The split between deterministic routing and language-model fallback is the architectural move that makes the local voice assistant feel fast and responsive. The cheap path is the common path; the expensive path is the fallback.
Voice synthesis that does not sound rented
Voice synthesis is the most-improved component in the last twelve months. Open voice-cloning frameworks running on Apple Silicon can render speech in a chosen voice profile from a few minutes of reference audio, with quality that crosses the threshold of being pleasant to listen to for extended periods.
For the household assistant we use a neutral synthesis voice that is consistent across sessions. For long-form content — listening to articles, hearing a daily briefing read aloud — we use a cloned voice, which makes the experience meaningfully better. The technical lift for voice cloning is a one-time training run; the ongoing cost is approximately zero.
Synthesis latency on the workhorse is a few seconds per response, which is acceptable for short replies and is a noticeable wait for long ones. The architecture handles this by streaming audio chunk-by-chunk: the first words start playing while the later words are still being synthesised. By the time the speaker hears the response begin, the rest is already queued.
The takeaway
An end-to-end local voice assistant is a serious project but no longer an exotic one. The components are mature, the quality is good, and the privacy profile is the kind of thing that is genuinely valuable rather than rhetorical. The household conversations that matter — what the family is doing, what is happening at home, what the assistant is being asked to do — never leave the building.
If you have a Big Tech voice assistant in your house and you have ever felt uncomfortable about what it might be hearing, the local replacement is more buildable than it looks. The architecture above is the version that works in production, and the components are open source and getting better quickly.
Working on this?
For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.
Get in touch