Observability Over Debugging
BuddingObservability is not a feature. It is a prerequisite.
The standard failure mode
A team builds a service. They ship it. It works. Three months later, it starts throwing errors at 3 AM. Someone spends four hours adding log statements, deploying, reproducing, and reading logs. They find the bug, fix it, and move on.
The four hours of debugging should have been ten minutes of dashboard reading. The difference is whether you designed for observability or bolted it on after the outage.
The three pillars (and the one nobody talks about)
Everyone knows the three pillars: logs, metrics, traces. The one nobody talks about is structured logging with correlation IDs.
A log line that says Error processing request is useless. A log line that says error=timeout service=payment request_id=abc-123 user_id=456 duration_ms=5002 is actionable. The difference is thirty seconds of thought when writing the log statement.
When to add observability
At design time. Not after the first outage.
When I write a design document, there is always a section on observability:
- What metrics will indicate this system is healthy?
- What alerts should fire when it is not?
- What correlation IDs will connect a request across services?
- What dashboards will the on-call engineer need?
The Datadog lesson at Clipboard Health
At Clipboard Health, I built Datadog dashboards for API performance. The dashboards caught a slow query pattern weeks before it would have caused a production incident. The query was fine at current data volumes but was O(n) on a table that was growing linearly. The dashboard showed the p99 latency trend. We fixed it on our schedule instead of at 3 AM.
That is what observability buys you: problems on your schedule instead of the system’s schedule.