The first thing I killed was the war room.
Someone had already booked it. A wall of dashboards, a bridge call that would run forty hours, sandwiches ordered for the team that would not sleep through the cutover weekend. It is the standard picture of a big migration, and everyone treats it as a sign of seriousness. I treated it as a sign that the plan was bad. If you need a room full of people staring at screens praying nothing breaks, you have already lost. You are just finding out in real time.
We were moving a consumer-internet business off one cloud and across four targets: two hyperscalers, a regional cloud for a market with data rules, and a colo for the things that genuinely could not move yet. Hundreds of microservices. Hundreds of databases, multi-petabyte in total. The kind of estate where nobody alive holds the whole dependency graph in their head, and the parts of it that are written down are wrong.
We finished ahead of schedule. Not because we were fast. Because we were boring.
The waves, and the rule that made them work
The whole thing ran as waves. Each wave was a slice of the estate you could move, verify, and run on the new side while the old side stayed up and warm behind it. Stateless services first, because they are cheap to move and cheap to undo. Then the data, the slow expensive part, copied and kept in sync long before anything pointed at it. Cutover last, and only for the slice that had already proven itself.
Three waves, four targets, and the rollback arrow that stayed live until a wave earned the right to lose it.
The rule that made it boring was simple to state and annoying to enforce: every wave is reversible until it is proven. Not “reversible in theory.” Reversible as in we had run the rollback, on that wave, in a rehearsal, and watched traffic land back on the source with our own eyes. A wave that could not be rolled back was not allowed to ship. It went back to planning until someone built the path home.
That sounds obvious. In practice it is the line most migrations cross without noticing. They get three waves deep, the rollback story quietly becomes “we’d never need to,” and then on the night they need to, they find out it was never real.
Data is a sync problem, not a copy problem
The naive mental model of moving a petabyte is a copy. You snapshot, you ship, you point the app at the new place. It is wrong in the way that gets you paged.
The real shape is a long period of replication where both sides are live and the new side is catching up and staying caught up, measured by lag, not by percent complete. For each database we stood up replication into the target, let it converge, and then watched it hold under real write load for days. Only when lag stayed flat and small did that database become eligible for cutover. The cutover itself was almost nothing: stop writes for a beat, let the last bytes drain, flip the connection string, let writes resume on the new side. The drama was all in the weeks of watching a lag graph, which is exactly where you want the drama to be.
Here is the gate we used, roughly. A wave did not cut over because a human felt ready. It cut over because it cleared the bar.
def wave_ready(wave):
# every check has to pass. one false and the wave waits.
# "felt ready" is not in this function on purpose.
for db in wave.databases:
if db.replication_lag_seconds > 5:
return False, f"{db.name}: lag still {db.replication_lag_seconds}s"
if not db.checksums_match(): # row-level, sampled, on real tables
return False, f"{db.name}: source/target checksum drift"
if not wave.rollback_rehearsed: # we actually ran it, not "should work"
return False, "rollback not rehearsed for this wave"
if wave.depends_on and not all(d.cut_over for d in wave.depends_on):
return False, "upstream dependency still on the old side"
return True, "ship it"
None of this is clever. That is the point. The cleverness in a migration is a liability. You want the same dull checklist applied to wave after wave until the whole estate has crossed.
The dependency nobody drew
The part that nearly went sideways was, of course, the part nobody had on a diagram.
We had a wave of services we were sure were independent. They had no shared database, no obvious call between them, nothing in the service catalog tying them together. We moved them on a Friday evening, the calm slot we used precisely because we expected nothing. Within an hour the error rate on a different cluster, in a wave we had finished weeks earlier, started climbing.
The link was a cache. One of the moved services warmed a shared cache that the older, already-migrated services read from on the assumption it would always be populated. Move the writer to a new network, add a few milliseconds and a different egress path, and the cache started going cold in a way it never had. The readers did not error on a missing cache. They fell back to the source of truth, hammered it, and the latency bled sideways into things that had looked done and dusted.
Here is the part nobody tells you about dependency surprises. The surprise is never in the dependency you can see. It is in the implicit contract, the thing one team assumed about another team’s behavior and never wrote down, the “it’s always just been there” that nobody can point to until it is gone. No service catalog catches those. Only traffic does.
What saved us was not heroics. It was the rule. That wave was still inside its reversible window, so we did the unglamorous thing: we rolled the Friday wave back to the source, the cache warmed the old way again, the readers recovered, and we went home. (Nobody ordered sandwiches.) On Monday we made the cache dependency explicit, decided the writer and its hidden readers were actually one wave, and moved them together the following week. It cut over without anyone noticing, which is the only review a cutover should ever get.
If that wave had not been reversible, that Friday is a war room. Forty hours, a bridge call, a postmortem with the word “unprecedented” in it. Instead it was a rollback and an early night.
Why ahead of schedule
People assume the reversible discipline slows you down. All those rehearsals, all that waiting on lag graphs, all those waves you could have batched into one bold weekend. It looks like overhead.
It is the opposite. The thing that blows migration timelines is not the moving. It is the recovery. One bad cutover with no way back eats weeks: the incident, the data reconciliation, the trust you have to rebuild with every team whose service you broke, the new caution that makes everyone slow-walk the next ten waves. We never paid that tax because we never took that risk. A rollback cost us an evening and a calendar slip on one wave. A failed irreversible cutover would have cost us the quarter.
Boring compounds. Each clean wave made the next one more routine, the runbook a little tighter, the team a little more willing to move at a steady clip because they knew the floor was there. We were not rushing toward a single terrifying date. We were turning a crank, and a crank you trust you can turn faster.
A migration that needs a war room has already failed planning. The war room is where you go to survive a plan that assumed nothing would go wrong. Build the plan that assumes everything will, make every step one you can walk back, and the heroic weekend never comes. You just wake up one morning and the old estate is empty, and the strangest part is how little anyone remembers about the day it happened.