Skip to content
2 min read

Zero-Downtime Patterns

Budding

At Clipboard Health, I migrated 2M+ healthcare documents from Cloudinary to S3. Nobody noticed, which was the point.

The pattern

Zero-downtime migration follows a sequence:

  1. Dual-write — New data goes to both the old and new system
  2. Backfill — Copy historical data to the new system
  3. Verify — Compare both systems to ensure consistency
  4. Switch — Route reads to the new system
  5. Cleanup — Remove the old system (after a grace period)

Each step is independently reversible. At no point is there a big-bang cutover.

Why not just “maintenance window”?

For a healthcare staffing platform, downtime means nurses cannot access the documents they need for their shifts. There is no good time for a maintenance window when your service is 24/7.

But even for services where downtime is technically acceptable, zero-downtime migration is still the right default. It forces you to think about backward compatibility, data consistency, and rollback — things you should be thinking about anyway.

The verification step is everything

The backfill is the easy part. The verification is where the real work happens. You need to prove that the new system returns the same results as the old system for every possible query pattern. For 2M documents with different metadata schemas, access patterns, and edge cases, this is not trivial.

I wrote a shadow verification pipeline that ran every read against both systems and compared results. When the diff rate hit zero across a week of production traffic, we switched.

The uncomfortable truth

Zero-downtime migration takes 3-4x longer than a maintenance window. It is worth it because the risk profile is fundamentally different. A maintenance window is a high-stakes, irreversible event. A gradual migration is a series of low-stakes, reversible steps.