Engineering

How DBMigrateAIPro Mathematically Proves Your Data is Correct

Rakesh Mamidala·Founder & Lead Engineer··10 min read

The Trust Problem in Every Migration

Every database migration ends the same way. The tool prints a number — "copied 47,328,914 rows" — and you have to decide whether to believe it. Row counts match? Great. But row counts match even when every numeric column has been silently truncated, every timestamp has shifted by an hour, every NULL has become an empty string.

The traditional answer is full-row checksum validation: re-read every row from both sides, hash them, compare. That works — but it takes hours on enterprise-scale tables, and most teams skip it under deadline pressure. The result: post-cutover discoveries weeks later, when finance reports come back wrong.

DBMigrateAIPro takes a different approach. Per-partition Merkle roots, computed inline during the COPY loop on both source and target, compared in seconds. Mathematical proof that every column of every row landed correctly — at O(log N) cost instead of O(N). On a 1-billion-row table, that's hours collapsing to seconds. It's the headline v1.1 feature, and it's the moment most migration audiences sit up straight.

A 60-Second Refresher on Merkle Trees

A Merkle tree is just a hash tree: every leaf is the hash of a value, every internal node is the hash of its children concatenated, and the root is the hash of everything below it. Two key properties make them useful:

  • Compact equality check. If two trees have the same root hash, every leaf is identical. One 32-byte comparison settles the question for arbitrary data sizes.
  • Fast localisation of differences. When roots differ, walk down the tree — the diverging sub-tree narrows the search to exactly which leaves disagree. O(log N) work to find the needle in the haystack.

Bitcoin uses Merkle trees to prove transactions are in a block. Git uses them to prove file trees haven't changed. We use them to prove your migrated rows match the source.

Our Tree Shape — Flat, Not Deep

Textbook Merkle trees are binary: every node has two children, depth grows logarithmically with N. For a 1B-row partition, that's a 30-deep tree. Walking 30 levels deep on a mismatch is fine, but it's also more complexity than the problem needs.

Our tree is intentionally flat: 1,024 leaves per group, two levels deep. For a partition with 10 million rows, that's ~10,000 group hashes (~320 KB) and a single 32-byte root. The mismatch path is root → group → individual row — exactly the debugging journey a customer wants. They don't want to traverse a 30-level abstract tree; they want to see the divergent row.

text
Partition root  (32 bytes — SHA-256 of all group hashes concatenated)
   │
   ├── group_0     (32 bytes — SHA-256 of first 1024 row hashes concatenated)
   │   ├── row_0      (32 bytes — SHA-256 of canonical row form)
   │   ├── row_1
   │   └── ... (1024 rows per group)
   ├── group_1
   ├── group_2
   └── group_N      (last group may have fewer than 1024 rows)

The Canonical-Form Problem

For two rows to hash identically across Oracle and PostgreSQL, they have to be byte-identicalwhen serialised. That sounds obvious but is the hardest part of the entire system — a single formatting difference and every hash is wrong on one side. Our canonical form rules:

  • Text / VARCHAR / CHAR — raw UTF-8 bytes. No length-padding (Oracle CHAR right-pads; PG doesn't — both sides strip before hashing).
  • NUMERIC — IEEE decimal canonical string. "1.5" not "1.50". "0" not "0.0".
  • TIMESTAMP / TIMESTAMPTZ — ISO-8601 normalised to UTC with explicit Z suffix. Both sides convert at hash time.
  • BYTEA / BLOB / RAW — raw bytes.
  • BOOLEAN — "t" or "f" (PG style), Oracle's NUMBER(1) values mapped at the type-mapping layer.
  • NULL — literal \x00 byte. Distinct from empty string "".

The single biggest footgun in any implementation: NULL vs empty string. (NULL, "foo") and ("", "foo") are different rows that some naive serialisations would collapse into the same byte sequence. Our canonical form distinguishes them, and our test suite (158 tests in the Merkle module alone) has fixtures locking that down.

Computed Inline, Not in a Second Pass

Implementations typically compute Merkle roots in a separate pass after migration — open both databases, re-read everything, hash, compare. That doubles I/O. We do it inline.

During the COPY loop that streams rows from source to target, the same row goes through three cheap operations: hashed for the source root, transformed by the type mapper, hashed for the target root. Hash cost is ~2 µs per row on modern CPUs — negligible against the network I/O the row was already going to incur. Source and target roots are both ready the moment the partition finishes loading. The compare is a 32-byte equality check.

python
# Inside the COPY loop, simplified
source_hasher = MerkleHasher(leaves_per_group=1024)
target_hasher = MerkleHasher(leaves_per_group=1024)

for row in source_extractor.stream():
    source_hasher.add_row(row)          # ~2 µs
    transformed = type_mapper(row)
    target_hasher.add_row(transformed)  # ~2 µs
    target_loader.copy_row(transformed) # the actual I/O — dominates

source_root = source_hasher.finalize()
target_root = target_hasher.finalize()

if source_root.merkle_root == target_root.merkle_root:
    # done. mathematical proof every row landed correctly.
    log("Partition verified — roots match")
else:
    # drill down to find the divergent group(s)
    diverge_groups(source_root, target_root)

Drill-Down: Finding the Needle

When roots differ — and they will, in real migrations: a NULL got dropped, a precision got lost, a timezone shifted — the diagnostic experience matters. Walking from "something is wrong somewhere in 50 million rows" to the exact offending row is what makes the difference between "the tool flagged it" and "we can fix this in 5 minutes".

Our drill-down is two clicks. Open the compare report, see which partitions are red. Open a red partition, see which group hashes diverged. The divergent group has 1,024 rows; we write them to a per-partition CSV (source rows + target rows side by side) so the engineer can diff them directly. Total time from "tool says mismatch" to "here's row 47,328,514, the salary column was rounded": under a minute.

What Every Other Tool Doesn't Do

The honest competitive landscape. The major migration tools all stop at "row count matches" — which is structurally insufficient for proving data correctness:

CapabilityDBMigrateAIProAWS DMSGoldenGateStriim / Fivetran
Per-partition cryptographic verification
O(log N) compare cost on mismatch
Drill-down to the exact divergent rows
Auditable proof JSON saved with the runpartial
Row-count "validation"
Full-row checksum comparison (O(N))

Cryptographic verification is genuinely novel in the migration-tool space. It compounds with the rest of our pitch: we don't just claim 95% automated conversion — we mathematically prove the 95% landed correctly.

The Audit Trail You Can Hand to a Customer

Every Merkle run writes three artefacts to 05_validation/ in the migration project:

  • source_roots.json — per-partition source-side Merkle roots, signed with run metadata
  • target_roots.json — same shape, target side
  • compare_report.html + .json — human-readable + machine-readable summary, partition-by-partition status, divergence drill-down where applicable

For a regulated customer (financial services, healthcare, public sector), these artefacts are the evidence pack. The auditor opens compare_report.html, sees green across every partition, the case closes. For internal teams, they're the receipt — the "here's why we're confident" proof that survives leadership turnover and post-mortems.

How to Turn It On

Single CLI flag, single GUI checkbox. From the command line:

bash
python main.py --no-ui migrate \
  --source oracle --source-host orahost --source-db PROD \
  --source-user demo_src --source-pass <pwd> \
  --target postgres --target-host pghost --target-db PROD \
  --target-user postgres --target-pass <pwd> \
  --table SALES.STORE_SALES \
  --merkle

# Final output line on success:
# ✓ Merkle verification: all partitions match
#   Proof JSON: projects/<JOB>/05_validation/compare_report.json
# Exit code 0.

# On divergence:
# ✗ Merkle verification: divergence detected — see compare_report.html
# Exit code 3. (Distinct from "migration failed" so CI can branch on it.)

Merkle verification is a Pro+ feature. The free tier ships with traditional row-count + checksum validation; Pro unlocks the cryptographic proof and the drill-down report.

The Honest Verdict

Cryptographic validation isn't a clever marketing line — it's the missing piece in every migration project that ever got blamed for data corruption it didn't cause. With Merkle proof in the pipeline, "did the migration land correctly" stops being a judgement call and becomes a boolean.

Every other tool will catch up eventually. They'll have to — the moment a procurement team sees a side-by-side compare matrix with this row in it, the question goes from "is X good enough?" to "why doesn't X have what Y has?" For now, this is where DBMigrateAIPro stands alone.

See the proof for yourself

DBMigrateAIPro v1.1 with Merkle verification is live in the current build. Free for Year 1 — no signup, no license key, full Pro features unlocked. Run it against a real database; check the proof JSON.