JW · Josh Weir
← AI Systems
Spoke · AI Systems

Local RAG vs hosted RAG: a working operator's framework

The retrieval-augmented generation pattern — embed your documents, store the embeddings, retrieve the relevant ones at query time, hand them to a model — has become a standard operator-grade move for any AI workflow that depends on knowledge the model was not trained on. The architectural question is whether you run the retrieval layer yourself or pay a hosted service to run it for you. The honest answer depends on four variables, and the calculation is more often in favour of running it yourself than the marketing of the hosted services suggests.

This piece is the framework I use when a client asks. It is not anti-hosted-RAG; some of the hosted services are genuinely good. It is pro-clarity about what you are buying, what you are giving up, and what the architecture actually requires.

What hosted RAG actually does for you

The honest case for the hosted version: someone else worries about embedding-model selection, the chunking strategy, the index storage, the retrieval API, the reranking, the scaling, the upgrades, and the integration with the most common model providers. You upload documents, you query, you get answers. For a team that is starting from zero and needs a working RAG pipeline by end of next week, the hosted option is the right answer.

The price you pay is on three axes. First, money — typically per-document-stored, per-query, or some combination. The numbers add up with volume. Second, lock-in — the embeddings, the index, and often the chunking are in the vendor's format and migrating out is non-trivial. Third, sovereignty — the documents you embedded are now sitting on someone else's infrastructure.

For exploratory work or for non-sensitive corpora, none of these prices is necessarily a deal-breaker. For a long-lived knowledge layer that holds material the business actually cares about, all three start to bite.

What local RAG requires you to do

Local RAG is four components. None of them is hard individually; the discipline is in keeping the pipeline coherent.

  • An embedding model running on your own hardware. Open-weights embedding models in 2026 are excellent and run comfortably on Apple Silicon at faster than real-time.
  • A vector store running on your own hardware. Open-source options are mature, support hybrid search, and scale to millions of documents on a single workstation.
  • A chunking and ingestion pipeline that takes raw documents in (PDF, markdown, web pages, transcripts, whatever) and produces chunks with metadata. This is the part teams under-invest in.
  • A retrieval and reranking layer that takes a query, retrieves candidates, reranks them, and returns the top-k. The reranker is often a smaller model that is much better at relevance than the embedding model alone.

Built and maintained correctly, this stack costs the electricity to run the workstation. The setup time is on the order of a few days for a competent operator. The ongoing maintenance is small but real — you are responsible for index health, for re-embedding when models are upgraded, for monitoring retrieval quality.

The four variables that decide

The decision between local and hosted comes down to four variables.

  1. Sensitivity. If the corpus is sensitive — internal client work, contracts, deal flow, health information, anything covered by data-protection law in any jurisdiction — local is the default. The argument for sending the data to a third party has to be unusually strong.
  2. Volume. Hosted services price for the volume they expect. At low volume — under a thousand documents, a few queries a day — the per-query price is invisible. At medium volume — tens of thousands of documents, hundreds of queries a day — the price starts to add up. At high volume the local stack is cheaper by a margin that pays off the setup time inside a quarter.
  3. Update frequency. A static corpus that is embedded once and queried many times is well-suited to either approach. A corpus that updates daily, where every change has to be re-embedded, favours the local stack — the round-trip latency of sending updates to a hosted service becomes operationally annoying.
  4. Customisation. If the workflow needs unusual chunking strategies, custom metadata, hybrid retrieval that combines vector with keyword and structured filters, the local stack gives you all of that for free. The hosted services give you what their API exposes, and not much more.

The chunking discipline

The single most under-discussed component of any RAG pipeline is the chunking strategy. Most teams default to fixed-size chunks with a small overlap and never revisit. The honest answer is that the right chunking strategy depends on the corpus, and the time spent getting it right pays back many times over in retrieval quality.

For long-form documents (articles, books, transcripts) we chunk by semantic boundaries — paragraphs, sections, speaker turns — with a token budget per chunk and a parent-document metadata tag. For structured documents (contracts, forms, technical reference) we chunk by section heading and preserve the hierarchy in metadata. For conversational data we chunk by message turn with conversation-thread context preserved.

The metadata is as important as the chunk content. A chunk with a date, a source, an author, a section heading, and a parent-document reference is dramatically more useful at retrieval time than a chunk with no metadata. Hosted services often constrain what metadata you can attach. The local stack does not.

Measuring retrieval quality

Retrieval quality is the single biggest determinant of answer quality in any RAG workflow. You should measure it. Most teams do not.

The minimum-viable measurement is a held-out evaluation set: a small number of representative queries, with the ideal documents to retrieve for each, scored against actual retrieval results. The metric is recall at k — what percentage of the time did the right documents make it into the top-k retrieved.

Run this evaluation after every change to the chunking strategy, the embedding model, or the reranker. The numbers move noticeably. Decisions that look architectural — “should we change embedding models?” — become empirical questions with measurable answers. Without the evaluation set you are guessing, and most guesses about retrieval quality are wrong.

The takeaway

Hosted RAG is a reasonable choice when sensitivity is low, volume is low, the corpus is static, and the customisation needs are minimal. Local RAG is the right choice for a long-lived knowledge layer that holds material the business actually depends on, that updates regularly, and that wants the architectural freedom to evolve.

The pattern is no longer hard to build. The discipline is in the chunking, the metadata, and the measurement. Get those right and the RAG layer becomes one of the most durable assets in the AI surface area of the business — owned, observable, and improving over time.

Working on this?

For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.

Get in touch

Search terms this article addresses

local rag vs hosted ragself-hosted retrieval augmented generationrag pipeline ukvector database self-hostedembedding model selectionrag chunking strategyrag evaluation methodologyprivate rag deployment

Related under AI Systems