Local RAG vs hosted RAG: a working operator's framework

What hosted RAG actually does for you

The honest case for the hosted version: someone else worries about embedding-model selection, the chunking strategy, the index storage, the retrieval API, the reranking, the scaling, the upgrades, and the integration with the most common model providers. You upload documents, you query, you get answers. For a team that is starting from zero and needs a working RAG pipeline by end of next week, the hosted option is the right answer.

The price you pay is on three axes. First, money — typically per-document-stored, per-query, or some combination. The numbers add up with volume. Second, lock-in — the embeddings, the index, and often the chunking are in the vendor's format and migrating out is non-trivial. Third, sovereignty — the documents you embedded are now sitting on someone else's infrastructure.

For exploratory work or for non-sensitive corpora, none of these prices is necessarily a deal-breaker. For a long-lived knowledge layer that holds material the business actually cares about, all three start to bite.

What local RAG requires you to do

Local RAG is four components. None of them is hard individually; the discipline is in keeping the pipeline coherent.

An embedding model running on your own hardware. Open-weights embedding models in 2026 are excellent and run comfortably on Apple Silicon at faster than real-time.
A vector store running on your own hardware. Open-source options are mature, support hybrid search, and scale to millions of documents on a single workstation.
A chunking and ingestion pipeline that takes raw documents in (PDF, markdown, web pages, transcripts, whatever) and produces chunks with metadata. This is the part teams under-invest in.
A retrieval and reranking layer that takes a query, retrieves candidates, reranks them, and returns the top-k. The reranker is often a smaller model that is much better at relevance than the embedding model alone.

Built and maintained correctly, this stack costs the electricity to run the workstation. The setup time is on the order of a few days for a competent operator. The ongoing maintenance is small but real — you are responsible for index health, for re-embedding when models are upgraded, for monitoring retrieval quality.

The four variables that decide

The decision between local and hosted comes down to four variables.

Sensitivity. If the corpus is sensitive — internal client work, contracts, deal flow, health information, anything covered by data-protection law in any jurisdiction — local is the default. The argument for sending the data to a third party has to be unusually strong.
Volume. Hosted services price for the volume they expect. At low volume — under a thousand documents, a few queries a day — the per-query price is invisible. At medium volume — tens of thousands of documents, hundreds of queries a day — the price starts to add up. At high volume the local stack is cheaper by a margin that pays off the setup time inside a quarter.
Update frequency. A static corpus that is embedded once and queried many times is well-suited to either approach. A corpus that updates daily, where every change has to be re-embedded, favours the local stack — the round-trip latency of sending updates to a hosted service becomes operationally annoying.
Customisation. If the workflow needs unusual chunking strategies, custom metadata, hybrid retrieval that combines vector with keyword and structured filters, the local stack gives you all of that for free. The hosted services give you what their API exposes, and not much more.

The chunking discipline

The single most under-discussed component of any RAG pipeline is the chunking strategy. Most teams default to fixed-size chunks with a small overlap and never revisit. The honest answer is that the right chunking strategy depends on the corpus, and the time spent getting it right pays back many times over in retrieval quality.

For long-form documents (articles, books, transcripts) we chunk by semantic boundaries — paragraphs, sections, speaker turns — with a token budget per chunk and a parent-document metadata tag. For structured documents (contracts, forms, technical reference) we chunk by section heading and preserve the hierarchy in metadata. For conversational data we chunk by message turn with conversation-thread context preserved.

The metadata is as important as the chunk content. A chunk with a date, a source, an author, a section heading, and a parent-document reference is dramatically more useful at retrieval time than a chunk with no metadata. Hosted services often constrain what metadata you can attach. The local stack does not.

Measuring retrieval quality

Retrieval quality is the single biggest determinant of answer quality in any RAG workflow. You should measure it. Most teams do not.

The minimum-viable measurement is a held-out evaluation set: a small number of representative queries, with the ideal documents to retrieve for each, scored against actual retrieval results. The metric is recall at k — what percentage of the time did the right documents make it into the top-k retrieved.

Run this evaluation after every change to the chunking strategy, the embedding model, or the reranker. The numbers move noticeably. Decisions that look architectural — “should we change embedding models?” — become empirical questions with measurable answers. Without the evaluation set you are guessing, and most guesses about retrieval quality are wrong.

The takeaway

Hosted RAG is a reasonable choice when sensitivity is low, volume is low, the corpus is static, and the customisation needs are minimal. Local RAG is the right choice for a long-lived knowledge layer that holds material the business actually depends on, that updates regularly, and that wants the architectural freedom to evolve.

The pattern is no longer hard to build. The discipline is in the chunking, the metadata, and the measurement. Get those right and the RAG layer becomes one of the most durable assets in the AI surface area of the business — owned, observable, and improving over time.

Working on this?

For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.

Get in touch

Search terms this article addresses

local rag vs hosted ragself-hosted retrieval augmented generationrag pipeline ukvector database self-hostedembedding model selectionrag chunking strategyrag evaluation methodologyprivate rag deployment

Local RAG vs hosted RAG: a working operator's framework

What hosted RAG actually does for you

What local RAG requires you to do

The four variables that decide

The chunking discipline

Measuring retrieval quality

The takeaway

Working on this?

Search terms this article addresses

Related under AI Systems

Cost per task, not cost per token — the right unit for AI economics

Prompt version control as proper engineering, not vibe coding

Skill architecture for AI orchestration: composable, testable, replaceable