Migration

Migrating Oracle LOB and XML Columns to PostgreSQL

Rakesh Mamidala·Founder & Lead Engineer··7 min read

Most of an Oracle→PostgreSQL migration is boring in a good way — numbers, strings, dates map cleanly. Then you hit the LOB columns: CLOB, NCLOB, BLOB, and XMLType. That’s where migrations quietly go wrong, first on correctness and then on speed.

The correct type mappings

Start here — get the target types right before anything else:

  • CLOB → TEXT — PostgreSQL text is unbounded; no need for a length.
  • NCLOB → TEXT — same; the national-charset distinction disappears once you’re in a UTF-8 database.
  • BLOB → BYTEA — binary stays binary.
  • XMLType → XML — PostgreSQL has a native xml type. Use it.
  • BFILE → (no real equivalent) — BFILE is a pointer to an OS file, not stored data; you migrate the bytes separately or store a path in text.

The XMLType → jsonb footgun

It is tempting to map XMLType to jsonb — both are “semi-structured,” and jsonb is the trendy choice. Don’t. XML is not JSON. The moment you try to load an XML value into a json column, every row fails:

text
ERROR: invalid input syntax for type json
DETAIL: Token "<" is invalid.
CONTEXT: COPY big_data_types, line 1, column xml_col: "<ROW><ID>3</ID></ROW>"

The fix is simply the right target type. PostgreSQL’s xmltype accepts XML content directly, validates it, and lets you query it with XPath later. If you genuinely want JSON, that’s a datatransformation (parse the XML, emit JSON) — a deliberate step, not a column type you can quietly swap in. Map the type faithfully first; transform data on purpose, never by accident.

The performance trap: LOBs are heavy per row

Correct types get the data in. Speed is the second, sneakier problem. A normal row is a few hundred bytes; a row with a couple of LOBs can be kilobytes to megabytes. Bulk loaders read in batches for throughput — but with LOB columns, every row in the batch carries its full LOB payload. A batch of 50,000 rows that’s fine for a skinny table becomes gigabytes of buffered memory on a LOB-heavy one. The loader thrashes, throughput collapses, and the table looks frozen even though it’s technically still running.

Two things make LOB reads fast and safe

  • Fetch LOBs as values, not locators.By default many drivers return a LOB locator — a handle you then read with a separate round-trip per LOB. Reading thousands of locators one at a time is brutally slow. Configure the driver to fetch CLOB/NCLOB as strings and BLOB as bytes inline (in python-oracledb, fetch_lobs=False) so the data streams with the row.
  • Use a smaller batch for LOB tables.The right batch size for a LOB-heavy table is much smaller than for the rest of the schema — hundreds of rows, not tens of thousands — to keep the per-fetch memory bounded. The migration should detect LOB columns and cap the batch automatically, not make you guess.

Validate LOBs by value, not by row count

Row counts won’t catch a truncated CLOB or a mis-encoded BLOB. After the load, validate LOB columns at the value level — a hash or checksum of the content, source vs target — so a silently shortened document or a botched binary is caught before cutover, not after a customer notices.

The short version

  • CLOB/NCLOB → text, BLOB → bytea, XMLType → xml (never jsonb).
  • Fetch LOBs inline as str/bytes, not as locators.
  • Give LOB-heavy tables a much smaller batch size than the rest.
  • Validate LOB content by value, not by row count.

None of this is exotic — but every one of these is a place a migration passes its row-count check and still ships corrupted or half-loaded data. Treat LOBs as their own little project inside the migration and they stop being the part that goes wrong.