How DBMigrateAIPro Mathematically Proves Your Data is Correct
The Trust Problem in Every Migration
Every database migration ends the same way. The tool prints a number — "copied 47,328,914 rows" — and you have to decide whether to believe it. Row counts match? Great. But row counts match even when every numeric column has been silently truncated, every timestamp has shifted by an hour, every NULL has become an empty string.
The traditional answer is full-row checksum validation: re-read every row from both sides, hash them, compare. That works — but it takes hours on enterprise-scale tables, and most teams skip it under deadline pressure. The result: post-cutover discoveries weeks later, when finance reports come back wrong.
DBMigrateAIPro takes a different approach. Per-partition Merkle roots, computed inline during the COPY loop on both source and target, compared in seconds. Mathematical proof that every column of every row landed correctly — at O(log N) cost instead of O(N). On a 1-billion-row table, that's hours collapsing to seconds. It's the headline v1.1 feature, and it's the moment most migration audiences sit up straight.
A 60-Second Refresher on Merkle Trees
A Merkle tree is just a hash tree: every leaf is the hash of a value, every internal node is the hash of its children concatenated, and the root is the hash of everything below it. Two key properties make them useful:
- Compact equality check. If two trees have the same root hash, every leaf is identical. One 32-byte comparison settles the question for arbitrary data sizes.
- Fast localisation of differences. When roots differ, walk down the tree — the diverging sub-tree narrows the search to exactly which leaves disagree. O(log N) work to find the needle in the haystack.
Bitcoin uses Merkle trees to prove transactions are in a block. Git uses them to prove file trees haven't changed. We use them to prove your migrated rows match the source.
Our Tree Shape — Flat, Not Deep
Textbook Merkle trees are binary: every node has two children, depth grows logarithmically with N. For a 1B-row partition, that's a 30-deep tree. Walking 30 levels deep on a mismatch is fine, but it's also more complexity than the problem needs.
Our tree is intentionally flat: 1,024 leaves per group, two levels deep. For a partition with 10 million rows, that's ~10,000 group hashes (~320 KB) and a single 32-byte root. The mismatch path is root → group → individual row — exactly the debugging journey a customer wants. They don't want to traverse a 30-level abstract tree; they want to see the divergent row.
Partition root (32 bytes — SHA-256 of all group hashes concatenated)
│
├── group_0 (32 bytes — SHA-256 of first 1024 row hashes concatenated)
│ ├── row_0 (32 bytes — SHA-256 of canonical row form)
│ ├── row_1
│ └── ... (1024 rows per group)
├── group_1
├── group_2
└── group_N (last group may have fewer than 1024 rows)The Canonical-Form Problem
For two rows to hash identically across Oracle and PostgreSQL, they have to be byte-identicalwhen serialised. That sounds obvious but is the hardest part of the entire system — a single formatting difference and every hash is wrong on one side. Our canonical form rules:
- Text / VARCHAR / CHAR — raw UTF-8 bytes. No length-padding (Oracle CHAR right-pads; PG doesn't — both sides strip before hashing).
- NUMERIC — IEEE decimal canonical string. "1.5" not "1.50". "0" not "0.0".
- TIMESTAMP / TIMESTAMPTZ — ISO-8601 normalised to UTC with explicit Z suffix. Both sides convert at hash time.
- BYTEA / BLOB / RAW — raw bytes.
- BOOLEAN — "t" or "f" (PG style), Oracle's NUMBER(1) values mapped at the type-mapping layer.
- NULL — literal
\x00byte. Distinct from empty string"".
The single biggest footgun in any implementation: NULL vs empty string. (NULL, "foo") and ("", "foo") are different rows that some naive serialisations would collapse into the same byte sequence. Our canonical form distinguishes them, and our test suite (158 tests in the Merkle module alone) has fixtures locking that down.
Computed Inline, Not in a Second Pass
Implementations typically compute Merkle roots in a separate pass after migration — open both databases, re-read everything, hash, compare. That doubles I/O. We do it inline.
During the COPY loop that streams rows from source to target, the same row goes through three cheap operations: hashed for the source root, transformed by the type mapper, hashed for the target root. Hash cost is ~2 µs per row on modern CPUs — negligible against the network I/O the row was already going to incur. Source and target roots are both ready the moment the partition finishes loading. The compare is a 32-byte equality check.
# Inside the COPY loop, simplified
source_hasher = MerkleHasher(leaves_per_group=1024)
target_hasher = MerkleHasher(leaves_per_group=1024)
for row in source_extractor.stream():
source_hasher.add_row(row) # ~2 µs
transformed = type_mapper(row)
target_hasher.add_row(transformed) # ~2 µs
target_loader.copy_row(transformed) # the actual I/O — dominates
source_root = source_hasher.finalize()
target_root = target_hasher.finalize()
if source_root.merkle_root == target_root.merkle_root:
# done. mathematical proof every row landed correctly.
log("Partition verified — roots match")
else:
# drill down to find the divergent group(s)
diverge_groups(source_root, target_root)Drill-Down: Finding the Needle
When roots differ — and they will, in real migrations: a NULL got dropped, a precision got lost, a timezone shifted — the diagnostic experience matters. Walking from "something is wrong somewhere in 50 million rows" to the exact offending row is what makes the difference between "the tool flagged it" and "we can fix this in 5 minutes".
Our drill-down is two clicks. Open the compare report, see which partitions are red. Open a red partition, see which group hashes diverged. The divergent group has 1,024 rows; we write them to a per-partition CSV (source rows + target rows side by side) so the engineer can diff them directly. Total time from "tool says mismatch" to "here's row 47,328,514, the salary column was rounded": under a minute.
What Every Other Tool Doesn't Do
The honest competitive landscape. The major migration tools all stop at "row count matches" — which is structurally insufficient for proving data correctness:
| Capability | DBMigrateAIPro | AWS DMS | GoldenGate | Striim / Fivetran |
|---|---|---|---|---|
| Per-partition cryptographic verification | ✓ | ✗ | ✗ | ✗ |
| O(log N) compare cost on mismatch | ✓ | ✗ | ✗ | ✗ |
| Drill-down to the exact divergent rows | ✓ | ✗ | ✗ | ✗ |
| Auditable proof JSON saved with the run | ✓ | ✗ | partial | ✗ |
| Row-count "validation" | ✓ | ✓ | ✓ | ✓ |
| Full-row checksum comparison (O(N)) | ✓ | ✓ | ✓ | ✓ |
Cryptographic verification is genuinely novel in the migration-tool space. It compounds with the rest of our pitch: we don't just claim 95% automated conversion — we mathematically prove the 95% landed correctly.
The Audit Trail You Can Hand to a Customer
Every Merkle run writes three artefacts to 05_validation/ in the migration project:
source_roots.json— per-partition source-side Merkle roots, signed with run metadatatarget_roots.json— same shape, target sidecompare_report.html+.json— human-readable + machine-readable summary, partition-by-partition status, divergence drill-down where applicable
For a regulated customer (financial services, healthcare, public sector), these artefacts are the evidence pack. The auditor opens compare_report.html, sees green across every partition, the case closes. For internal teams, they're the receipt — the "here's why we're confident" proof that survives leadership turnover and post-mortems.
How to Turn It On
Single CLI flag, single GUI checkbox. From the command line:
python main.py --no-ui migrate \
--source oracle --source-host orahost --source-db PROD \
--source-user demo_src --source-pass <pwd> \
--target postgres --target-host pghost --target-db PROD \
--target-user postgres --target-pass <pwd> \
--table SALES.STORE_SALES \
--merkle
# Final output line on success:
# ✓ Merkle verification: all partitions match
# Proof JSON: projects/<JOB>/05_validation/compare_report.json
# Exit code 0.
# On divergence:
# ✗ Merkle verification: divergence detected — see compare_report.html
# Exit code 3. (Distinct from "migration failed" so CI can branch on it.)Merkle verification is a Pro+ feature. The free tier ships with traditional row-count + checksum validation; Pro unlocks the cryptographic proof and the drill-down report.
The Honest Verdict
Cryptographic validation isn't a clever marketing line — it's the missing piece in every migration project that ever got blamed for data corruption it didn't cause. With Merkle proof in the pipeline, "did the migration land correctly" stops being a judgement call and becomes a boolean.
Every other tool will catch up eventually. They'll have to — the moment a procurement team sees a side-by-side compare matrix with this row in it, the question goes from "is X good enough?" to "why doesn't X have what Y has?" For now, this is where DBMigrateAIPro stands alone.
See the proof for yourself
DBMigrateAIPro v1.1 with Merkle verification is live in the current build. Free for Year 1 — no signup, no license key, full Pro features unlocked. Run it against a real database; check the proof JSON.
- 🔗 Download the desktop tool: medaxai.com
- 🔗 Try the in-browser schema converter: medaxai.com/tools/schema-converter
- 🔗 Related — Why Data Validation Is the Most Underrated Migration Step