Home Assistant + local AI: replacing Big Tech voice assistants in a way that doesn't degrade

TL;DRThe voice assistants from the major Big Tech vendors got worse, not better, between 2022 and 2025. They became less capable, more advertising-laden, and more aggressively cloud-dependent.

The voice assistants from the major Big Tech vendors got worse, not better, between 2022 and 2025. They became less capable, more advertising-laden, and more aggressively cloud-dependent. Anyone running a serious smart-home setup either worked around them or switched away. We switched away. The replacement, built on open-source home automation and a local model stack, is faster, smarter, and never asks me whether I want to upgrade my music subscription mid-sentence.

This is a guide to that replacement. It is not theoretical. It runs in our house, every day, controlling lighting, climate, security, media, and the small daily intelligence layer of "is the laundry done", "what's the weather doing later", and "remind me at 6 to put the dinner on". It does so without sending a single phrase out to a cloud transcription service. The privacy story is real, and the experience is better.

What "degrade" actually means here

The trap with replacing a polished commercial voice assistant is that early-stage open-source replacements tend to feel like a downgrade in three specific ways: they get hot-word detection wrong, their speech-to-text is slow, and their response generation is robotic. Any of those breaks the trust loop. Once a household member tries the new system twice and it doesn't work, they will not try a third time.

The brief I set was: equal or better wake-word reliability, sub-1.5s end-to-end response latency for common queries, and natural-sounding speech output. If we couldn't hit that, we'd stay on the commercial stack and be unhappy. Hitting it required treating the voice loop as an end-to-end engineering problem, not a sum of independently-chosen components.

The architecture, in five components

The home-automation engine — the open-source platform that owns device state, automations, and the canonical model of "what is true in this house". This is the brain stem. Everything else hangs off it.
The wake-word + voice front-end — small voice-PE devices distributed around the house. These do wake-word detection on-device and stream audio only after the wake word fires. Off-the-shelf hardware is fine; we use small open-firmware microphone arrays.
The transcription layer — a Whisper-family model running locally, configured for low-latency streaming. Sub-second transcription for short queries.
The intent and orchestration layer — a small local language model classifies the user's intent, extracts parameters, and dispatches the right action. For complex queries it composes a longer-form response.
The text-to-speech layer — local neural TTS that produces speech indistinguishable from cloud incumbents at conversational length. We run a multi-voice setup so different rooms can have different voices.

None of this is exotic. All of it is open-source. The work is in the integration and in the latency budget — every layer has to do its job inside its slice of the budget or the conversational feel breaks.

The latency budget

For a voice assistant to feel conversational rather than transactional, the perceived gap between the end of the user's utterance and the start of the assistant's response should be under one second for simple queries. That sounds aggressive until you decompose it:

Wake-word detection — already fired, on-device, ~0ms perceived.
End-of-speech detection — requires a small grace period, typically 300–500ms.
Final transcription — locally, on a streaming Whisper model, 100–250ms after end-of-speech.
Intent classification + action dispatch — small local model, 100–200ms.
TTS first-syllable latency — 150–300ms with a fast neural TTS model.

The total is somewhere between 650ms and 1.4s end-to-end for an in-budget query. That hits the conversational threshold. The trick is that nothing leaves the box, so there is no internet round trip, no jitter from a remote API, no rate-limit risk. Once you've got it working, it feels qualitatively different from cloud assistants.

The intent layer is where the art lives

The traditional smart-home approach to voice intents is a dictionary of phrases mapped to actions: "turn off the kitchen light" maps to a specific automation. This breaks the moment a real human speaks naturally. "Kill the kitchen lights", "sort out the kitchen lighting", "can you do the kitchen for me" — none of these match a literal phrase. Big Tech voice assistants solved this with cloud-scale natural-language understanding. We solve it locally with a small LLM doing intent classification.

The pattern is: the local model receives the raw transcription, a list of available intents (turn-off, turn-on, set-temperature, query-state, run-scene, etc.) and a list of available entities (rooms, devices, scenes, etc.). It returns a structured JSON describing intent and parameters. The home-automation engine takes that JSON and dispatches.

The model is small enough to respond inside the latency budget. The intents and entities are derived dynamically from the home-automation engine's state, so adding a new device doesn't require retraining anything. New intent? Add a row. New device? It auto-registers. The system grows without rework.

The complex query case

For simple commands the architecture above is enough. The trickier case is when a household member says something like "what's on the calendar tomorrow" or "summarise my morning briefing" — queries that require composing information rather than dispatching an action. For these we route to a slightly larger local model (still local — never the cloud for a voice query in this house) and let it produce a short paragraph response.

The latency budget loosens because spoken responses to longer queries naturally take longer; nobody minds a half-second pause before a 30-second answer if the answer is good. The trick is that the larger model only spins up for queries the intent classifier flags as "compose-response" rather than "dispatch-action". The simple stuff stays fast.

What you give up and what you don't

Honest list of tradeoffs:

You give up the always-listening microphones inside Big Tech device fleets. Some people see that as a feature.
You give up the integration with the Big Tech ecosystem of "skills" — calendars, music, shopping. Most of those have local equivalents that work via the home-automation engine. Calendars — yes. Music — yes, via local media servers. Shopping — frankly, you didn't want voice shopping anyway.
You give up the polished onboarding experience. The first hour of setup is fiddly. Once it works, it works.
You don't give up wake-word reliability — modern open wake-word models are at parity for common phrases.
You don't give up speech quality — modern local TTS is conversational and natural.
You don't give up speed — as discussed above, you actually gain it.
You don't give up smart-home integration depth — this is where the open ecosystem actually exceeds Big Tech, because the home-automation engine talks to far more devices than any single commercial voice assistant.

The privacy posture, properly stated

The privacy claim is specific and worth saying carefully. With this setup, once configured:

Wake-word detection is on-device. No audio leaves the microphone array until the wake word fires.
Transcription is local. No audio leaves the home network.
Intent classification and response generation are local.
TTS is local.
Smart-home actions are dispatched locally.

The system reaches the internet for things that genuinely require the internet — fetching the weather, fetching the news headlines, calendar synchronisation — but voice queries themselves are end-to-end local. That's a meaningful distinction from any cloud-first architecture, and it's the architectural property that motivated the rebuild in the first place.

Build sequence I'd recommend

Get the home-automation engine running with all your devices integrated. This is the foundation. Don't skip steps; if your device coverage is thin, the voice layer will be unsatisfying.
Add local transcription. Test it with text-only voice notes before you wire up microphones.
Stand up a small intent-classification model. Define a starting set of 10–20 intents. Test the round trip.
Wire in the voice front-end hardware. Tune the wake-word and end-of-speech timing.
Add local TTS. Pick a voice you can stand listening to a hundred times a day.
Iterate the intent set as you discover the gaps. The gaps are the actual product.

The hardware sizing question for the smart home

Households over-provision the central node and under-provision the voice front-ends, every time. The right shape is the opposite. The central inference node — where Whisper, the intent model, and TTS run — needs enough memory to hold the models warm and enough CPU to render TTS in real time, but it does not need to be exotic. A single mid-range Apple Silicon Mac mini class machine handles the entire voice loop for a six-room household with headroom. The voice front-ends, by contrast, want decent microphone arrays — multi-mic beamforming makes a noticeable difference in a noisy room. A cheap microphone with no array will frustrate the user across every other component.

Network-wise, all of this should be on a private VLAN with the home-automation engine. Voice traffic should never traverse the internet. If you can't enforce that as a network property, you have not yet built a privacy-preserving system; you have built one that depends on configuration discipline that will eventually fail.

Storage requirements are minimal. Models cached on disk, a small log of intent-classification samples for review and improvement, the home-automation engine's own state. A 256GB SSD is generous. The compute side dominates everything; storage is a footnote.

Where this stack overlaps with the broader AI infrastructure

If you've already built a sovereign-AI infrastructure for business workloads, the voice-assistant stack reuses most of it. The Whisper-class transcription, the local LLM cluster, the TTS layer — these are general-purpose components. The smart-home voice loop is one consumer among many. That's the architectural payoff of treating sovereign AI as infrastructure rather than as a single application: the substrate serves the office, the home, the workshop, the car. Each new use case adds incrementally rather than starting from scratch.

For households that haven't yet built the broader infrastructure, the smart-home voice rebuild is a reasonable place to start. The use case is bounded, the failure modes are recoverable (a missed intent isn't a production incident), and the household members give you direct, immediate feedback on quality. By the time it works well, you'll have learned enough about the substrate to apply it elsewhere.

Replacing the Big Tech voice assistant in a serious smart home is no longer a curiosity project. The component pieces — open-source automation, local Whisper-class transcription, small fast LLMs, neural TTS — are all production-ready, all locally hostable, and all integrate with each other through well-understood interfaces. The result genuinely beats the commercial incumbents on speed and depth, and the privacy story is the byproduct rather than the marketing.

If you're going to do it, do it as a household-grade engineering project rather than a hobby. The thresholds for "my partner will use it" are higher than the thresholds for "I will use it". Hit the higher bar and the system stops being a project.

Audit your smart home Wondering whether your smart-home stack is ready for a serious voice rebuild — or whether the foundations need work first? Book a smart-home audit and we'll map the path. Book a smart-home audit →