What Happens When a Migration Crashes Halfway?
Picture a 400 GB bulk load at 70%. Then the network blips, the box reboots, or someone fat-fingers Ctrl-C. The data already moved is fine — the scary part is what happens when you press start again.
Two ways a restart goes wrong
Most tooling fails one of two ways on resume:
- Start over. Safe but brutal — you re-copy hundreds of gigabytes you already moved, turning a 20-minute finish into a multi-hour redo, often inside a tight cutover window.
- Resume blindly. Faster, but if the bookkeeping is even slightly off it re-loads a chunk that already committed — silently duplicating rows.
The second one is the dangerous failure: it doesn’t error, it just leaves you with more rows than the source. The right answer is neither — it’s idempotency: a resume should skip exactly what finished and redo exactly what didn’t.
The race that duplicates rows
Bulk loaders split a table into chunks and run them in parallel. Each chunk does two things: it COPYs its slice into the target and commits, and it records itself as “done” in a checkpoint so a resume knows to skip it. The bug hides in the gap between those two steps:
worker copies chunk 7 ........ COMMIT ✅ (rows are now in the target)
← CRASH HERE
checkpoint: mark chunk 7 done ........ ❌ never ranThe rows are committed in the database, but the checkpoint — a separate file — never recorded it. On resume, chunk 7 looks unfinished, so it runs again and appends its rows a second time. Because the checkpoint and the data live in two different places, they can disagree the moment a crash lands between them.
Why a missing primary key makes it invisible
If the table has a primary key or unique index, the duplicate insert fails loudly — a constraint violation you’ll notice. But plenty of real tables have no unique key: append-only event logs, audit trails, staging tables, wide denormalized marts. There, the duplicate rows just… land. Row counts quietly drift above the source, and the first time you find out is when a downstream SUM() doesn’t reconcile weeks later.
The fix: make “committed” and “done” one fact
The cure is to stop keeping the data and the “done” flag in two places that can fall out of sync. Instead, write a small per-chunk done-marker into the target database inside the same transactionas the chunk’s COPY:
BEGIN
COPY chunk 7 rows → target table
INSERT (job, table, chunk 7) → __migrate_chunks -- the done-marker
COMMIT -- data and marker commit together, or not at allNow the two facts are atomic. If the commit succeeded, the marker is there; if the crash hit mid-chunk, the whole transaction rolled back — no rows, no marker. On resume the loader simply asks the target which chunks are marked done and skips exactly those; everything else re-runs from clean. A chunk can never be both “in the data” and “not recorded” again.
Two details make it robust: a fresh (non-resume) run clears its own markers first, so re-running a migration from scratch never wrongly skips work; and a chunk interrupted before commit leaves nothing behind, so it re-copies safely.
What this means for your cutover
Interruptions during a big migration aren’t an edge case — over a long enough load, something will hiccup. Crash-safe resume turns that from a heart-stopping moment into a non-event: hit start again, the finished chunks are skipped, the unfinished ones redo, and the final row counts match the source exactly. That’s the difference between a migration you have to babysit and one you can trust to pick itself back up.
It pairs with the other half of trust — proving the data arrived intact. Idempotent resume guarantees you don’t duplicate or lose rows on a restart; per-partition validation proves, after the fact, that every row matches. Together they’re what let you cut over without crossing your fingers.