What you actually need to see
Most observability content discusses observability as if every team needs the full surface. They do not. The honest list of what an operator-grade business actually needs visibility into:
- Whether the public-facing endpoints are responding, and how fast.
- Whether each workflow is succeeding at its expected cadence, and how long it takes when it does.
- How much each workflow costs in AI tokens, currency, and human-time-equivalent saved.
- Whether the data stores are healthy: disk pressure, query latency, replication lag if any.
- Whether the host machines are healthy: CPU, memory, disk, temperature where relevant.
- What just broke, when, where, and whether anything was auto-recovered.
That is roughly thirty to fifty distinct metrics for a small operation, with a sane dimensional split (per-workflow, per-host, per-service). It is not a million metrics. The high end of the SaaS observability category is sized for organisations that genuinely need a million metrics. Most operators do not.
The three components
The pattern is three components, all open-source, all running on the workhorse.
The metrics store is a time-series database. Every metric is a measurement with a name, a set of tag dimensions, a timestamp, and a numeric value. The store accepts writes from the orchestration layer, from a host-metrics daemon, and from a small set of HTTP endpoints we expose for ad-hoc instrumentation. Retention is two years for everything, with downsampling on the older data so the disk footprint stays manageable.
The dashboard layer reads from the metrics store and renders human-readable views. We maintain about a dozen dashboards: an executive overview, infrastructure health, AI cost analytics, error tracking, social-energy correlation, service health detail, communications volumes, CRM and revenue KPIs, content and intelligence metrics. Each dashboard is a JSON file in version control. We do not edit them in the UI; we edit them in the file and commit.
The alerting layer evaluates rules against the metrics store on a thirty-second cadence. When a rule fires, it routes a notification to the messaging stack. The rules are deliberately conservative. We have learned that an alerting layer that fires false positives is worse than one that does not exist.
The instrumentation discipline
The architecture is half the story. The other half is the discipline of consistently writing measurements from every workflow.
Our convention is that every workflow execution writes two measurements: one at the start and one at the end. The start measurement records the workflow name, the trigger source, and a unique execution ID. The end measurement records the same plus duration, success or failure, AI tokens consumed, AI cost in currency, and human-time-equivalent saved. The dashboards aggregate from these.
The discipline matters because retroactive instrumentation is much more expensive than upfront. Every workflow we ship now has the measurements built in by default. The orchestration template includes them. New developers do not have to remember; the framework remembers for them.
The other discipline is type safety. The metrics store rejects mixed types per field — if a metric is a float in one write and an integer in the next, the second write fails. We learned this the loud way. The fix is a wrapping function that coerces every numeric field to its declared type before write. Once the discipline is in place, the data is durable.
Alerting that does not become noise
The single hardest thing in observability is not the data; it is the discipline of alerting that fires only when something genuinely needs human attention.
Our rule for whether something gets a real alert: could a human do anything useful about it in the next ten minutes? If yes, alert. If no, log it and surface it on a dashboard.
What gets a real-time alert: a public endpoint failing for more than ninety seconds. A workflow that has not run on its expected cadence for more than three intervals. A disk pressure alarm above ninety percent. A daily AI cost crossing twice the trailing thirty-day average.
What does not get a real-time alert: any single workflow execution failing. (The orchestration layer retries automatically. If the failure persists, the cadence rule above will catch it.) Slow queries below a threshold. Single-digit-percent error rates that have been steady for weeks.
The alerting layer that does not cry wolf is the alerting layer you actually look at when it fires. Building it conservatively is the highest-leverage thing you can do for the long-term operability of the stack.
What we deliberately do not capture
Sovereignty discipline applies to observability too. We deliberately do not capture: the full content of every workflow input or output (privacy and disk pressure), end-user behavioural analytics on the public site beyond first-party server logs (we do not sell this and it is intrusive to collect), or detailed performance traces below the second-level granularity (we have rarely needed it and the cost of the extra dimensionality is real).
The principle is that observability is for operating the stack, not for accumulating data. The smallest set of metrics that reliably answers the operational questions is the right set. Every additional metric is a maintenance cost, a storage cost, and a privacy decision. The minimal complete answer is the durable answer.
The takeaway
Self-hosted observability is not a heroic engineering effort in 2026. The components are mature, the protocols are stable, and the operating cost is the electricity to keep the workhorse on. The discipline that makes it durable is upstream: consistent instrumentation in every workflow, type safety on every field, alerts that fire only when a human can act, and a rule against capturing data you do not need.
The asset that comes out of this discipline is the dashboards themselves. They become the surface through which we read the operational state of the business at a glance, and they accumulate value every month as the historical comparison set deepens. Renting that capability is a perpetual tax. Building it is a one-time investment that pays back permanently.
Working on this?
For operators evaluating sovereign-infrastructure architecture for a business of meaningful scale, we run a quarterly cohort of stack-design engagements.
Get in touch