Migrating Petabytes Across Four Clouds Without a War Room

You do not need a war room to move petabytes across four clouds. The first thing I killed on this migration was the one somebody had already booked.

A wall of dashboards, a bridge call that would run forty hours, sandwiches ordered for the team that would not sleep through the cutover weekend. It is the standard picture of a big migration, and everyone treats it as a sign of seriousness. I treated it as a sign that the plan was bad. If you need a room full of people staring at screens praying nothing breaks, you have already lost, and the war room is just where you find that out in real time.

We were moving a consumer-internet business off one cloud and across four targets: two hyperscalers, a regional cloud for a market with data rules, and a colo for the things that genuinely could not move yet. Hundreds of microservices. Hundreds of databases, multi-petabyte in total. The kind of estate where nobody alive holds the whole dependency graph in their head, and the parts of it that are written down are wrong.

We finished ahead of schedule. Being boring got us there faster than being fast ever could have.

The waves, and the rule that made them work

The whole thing ran as waves. Each wave was a slice of the estate you could move, verify, and run on the new side while the old side stayed up and warm behind it. Stateless services first, because they are cheap to move and cheap to undo. Then the data, the slow expensive part, copied and kept in sync long before anything pointed at it. Cutover last, and only for the slice that had already proven itself.

Migration in three waves across four targets, every wave reversible until proven, with a rollback path back to the source estate

Three waves, four targets, and the rollback arrow that stayed live until a wave earned the right to lose it.

The rule that made it boring was simple to state and annoying to enforce: every wave is reversible until it is proven. Not reversible in theory, but reversible as in we had run the rollback, on that wave, in a rehearsal, and watched traffic land back on the source with our own eyes. A wave that could not be rolled back was not allowed to ship; it went back to planning until someone built the path home.

That sounds obvious, but in practice it is the line most migrations cross without noticing. They get three waves deep, the rollback story becomes “we’d never need to,” and on the night they need to, they find out it was never real.

Data is a sync problem, not a copy problem

The naive mental model of moving a petabyte is a copy. You snapshot, you ship, you point the app at the new place. It is wrong in the way that gets you paged.

The real shape is a long period of replication where both sides are live and the new side is catching up and staying caught up, measured by lag, not by percent complete. For each database we stood up replication into the target, let it converge, and then watched it hold under real write load for days. Only when lag stayed flat and small did that database become eligible for cutover. The cutover itself was almost nothing: stop writes for a beat, let the last bytes drain, flip the connection string, let writes resume on the new side. The drama was all in the weeks of watching a lag graph, which is exactly where you want the drama to be.

We wrote the gate into code, so a wave earned its cutover by clearing the bar rather than by anyone feeling ready.

def wave_ready(wave):
    # every check has to pass. one false and the wave waits.
    # "felt ready" is not in this function on purpose.
    for db in wave.databases:
        if db.replication_lag_seconds > 5:
            return False, f"{db.name}: lag still {db.replication_lag_seconds}s"
        if not db.checksums_match():      # row-level, sampled, on real tables
            return False, f"{db.name}: source/target checksum drift"
    if not wave.rollback_rehearsed:       # we actually ran it, not "should work"
        return False, "rollback not rehearsed for this wave"
    if wave.depends_on and not all(d.cut_over for d in wave.depends_on):
        return False, "upstream dependency still on the old side"
    return True, "ship it"

None of this is clever, which is the point. Cleverness in a migration is a liability; what you want is the same dull checklist applied to wave after wave until the whole estate has crossed.

The dependency that never made the diagram

The wave that nearly went sideways depended on something no diagram showed.

We had a wave of services we were sure were independent. They had no shared database, no obvious call between them, nothing in the service catalog tying them together. We moved them on a Friday evening, the calm slot we used precisely because we expected nothing. Within an hour the error rate on a different cluster, in a wave we had finished weeks earlier, started climbing.

The link was a cache. One of the moved services warmed a shared cache that the older, already-migrated services read from on the assumption it would always be populated. Move the writer to a new network, add a few milliseconds and a different egress path, and the cache started going cold in a way it never had. The readers did not error on a missing cache. They fell back to the source of truth, hammered it, and the latency bled sideways into things that had looked done and dusted.

Dependency surprises never live in the dependency you can see. They live in the implicit contract, the assumption one team made about another team’s behavior and never wrote down, the “it’s always just been there” that nobody can point to until it is gone. A service catalog will not catch those. Only traffic does.

What saved us was the rule, not any heroics. That wave was still inside its reversible window, so we did the unglamorous thing: we rolled the Friday wave back to the source, the cache warmed the old way again, the readers recovered, and we went home. (Nobody ordered sandwiches.) On Monday we made the cache dependency explicit, decided the writer and its hidden readers were actually one wave, and moved them together the following week. It cut over without anyone noticing, which is the only review a cutover should ever get.

If that wave had not been reversible, that Friday is a war room. Forty hours, a bridge call, a postmortem with the word “unprecedented” in it. Instead it was a rollback and an early night.

Why ahead of schedule

People assume the reversible discipline slows you down. All those rehearsals, all that waiting on lag graphs, all those waves you could have batched into one bold weekend. It looks like overhead.

The opposite is true: a migration timeline dies in recovery, not in the move. One bad cutover with no way back eats weeks: the incident, the data reconciliation, the trust you have to rebuild with every team whose service you broke, the new caution that makes everyone slow-walk the next ten waves. We never paid that tax because we never took that risk. A rollback cost us an evening and a calendar slip on one wave; a failed irreversible cutover would have cost us the quarter.

Boring also compounds: each clean wave made the next one more routine, the runbook a little tighter, the team a little more willing to move at a steady clip because they knew the floor was there. We were not rushing toward a single terrifying date; we were turning a crank, and a crank you trust you can turn faster.

A migration that needs a war room has already failed in planning. That room exists so a team can survive a plan that assumed nothing would break, and a plan built the other way, with every step walkable back, never has to reserve it.

We woke up one morning and the old estate was empty, and the strangest part was how little anyone remembered about the day it happened.