JW · Josh Weir
← AI Systems
Spoke · AI Systems

Prompt version control as proper engineering, not vibe coding

The way most teams handle prompts in 2026 would be unthinkable if applied to any other piece of production code. Prompts live in a notebook, or pasted in a Slack message, or hardcoded in a function with no comments and no history. When something breaks, no one is sure when the prompt last changed or what version is currently live. When a new model is released, the team has no systematic way to test whether the existing prompts still work.

This is solvable. Prompts are code. They should be in version control, they should be tested, they should be evaluated, and the changes to them should be reviewable. The discipline is not exotic. It is the basic engineering hygiene that has been applied to every other production artifact in the stack for two decades. The point of this piece is to translate that hygiene into the specific shape it takes for prompts.

Prompts as files, not strings

The first move is mechanical: every production prompt lives in a file in a repository. Not in a notebook. Not in a database column. Not in a string literal in application code. A file, in a directory, with a name that describes its purpose.

The file format we use is plain text with a YAML front-matter header. The header records the prompt's version, the date last changed, the model it was tuned against, the task it serves, and a free-form comment about the change history. The body is the prompt itself, with template variables marked in a consistent syntax.

Application code references prompts by name. The runtime loads the prompt from the file system, fills the template variables, and submits it to the model. Changing a prompt is editing the file, committing, and redeploying. The change is reviewable as a diff. The history is searchable in version-control logs. The institutional knowledge of why a prompt looks the way it does is captured in commit messages.

The evaluation harness

Prompts in version control are necessary but not sufficient. Without an evaluation harness, you cannot tell whether a change to a prompt improved it, regressed it, or did nothing.

The evaluation harness is a small framework that, for each prompt, runs it against a set of test cases and scores the output. The test cases are representative inputs. The scoring is whatever is appropriate to the task — exact-match for classification, schema validity for structured output, an LLM-as-judge for free-form generation, a small regression test for code-like artifacts.

The harness produces a score per test case, an aggregate score per prompt, and a comparison against a baseline (typically the previous version of the same prompt). The score is logged in version control alongside the prompt. Every prompt change has a measurable before-and-after.

Building the harness is the highest-leverage AI engineering move you can make. The amount of time it saves over the lifetime of a prompt — by catching regressions early, by enabling confident iteration, by giving you a quantitative basis for model migrations — is enormous.

Promotion through environments

The same discipline that applies to code applies to prompts: changes are developed in a non-production environment, evaluated, and promoted to production after passing.

Our flow is three-stage. A new prompt version is written and committed to a development branch. The evaluation harness runs automatically on commit. If the score meets a threshold, the prompt is promoted to a staging environment where it runs against a small slice of real traffic. If the live metrics from the staging run are healthy after a defined window, the prompt is promoted to production. If anything goes wrong at any stage, the rollback is a single revert.

The infrastructure for this is small — a continuous-integration pipeline, a feature flag for which prompt version is live in which environment, and the same observability layer that watches every other workflow. The discipline is what makes prompt changes safe and frequent, rather than rare and terrifying.

The model migration test

The most expensive failure of un-versioned prompts is what happens when the underlying model changes. The provider deprecates a model, you migrate to the replacement, half your prompts work as well as before, and a quarter of them silently regress. Without an evaluation harness you discover the regressions in production, weeks later, when a customer notices.

With an evaluation harness, the model migration is a defined exercise. You run every prompt against the new model with the same test cases. The harness reports which prompts regressed, by how much, on which test cases. The regressions are addressed in priority order — usually with prompt edits, occasionally with structural redesigns. The migration is complete when every prompt scores at or above its pre-migration baseline.

This is the difference between treating model migrations as crises and treating them as routine engineering work. The harness is what makes the difference.

What we deliberately do not formalise

Some parts of the prompt-engineering process resist formalisation, and it is honest to say so.

The actual writing of a prompt is still craft. The discipline tells you when a prompt is wrong; it does not tell you what the right one is. We treat prompts as drafts that go through several iterations before the harness is even consulted. The harness is the final check, not the writing tool.

Test-case authoring is similarly judgement-heavy. A test case that looks edge-case-representative often is not. We err on the side of more test cases, gathered from real production failures, rather than synthetic test cases generated by a model. The harness catches regressions on the cases you care about.

And model selection itself is judgement plus measurement. The harness can tell you that prompt A is better than prompt B on model X. It cannot tell you whether model X is the right model for this workflow at all; that is upstream of the evaluation.

The takeaway

Prompts are not magic incantations. They are code. The engineering hygiene that makes any other piece of production code reliable applies to them with no modification — version control, evaluation, promotion through environments, regression testing on dependency changes. The cost of building this discipline is a few weeks of upfront work. The cost of not building it is the slow erosion of trust in the AI surface area of the business as drift accumulates and no one can tell whether the workflows that worked last month still work today.

If your prompts are still strings in application code, the migration to versioned files is a one-week project. If your prompts have no evaluation harness, building one is a two-week project. Neither is heroic. Both are necessary if AI is going to be a load-bearing part of the operation.

Working on this?

For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.

Get in touch

Search terms this article addresses

prompt version control ukprompt engineering disciplinellm evaluation harnessprompt regression testingprompt management frameworkai prompt engineering best practicemodel migration testingproduction prompt management

Related under AI Systems