Migrating Oracle LOB and XML Columns to PostgreSQL
Most of an Oracle→PostgreSQL migration is boring in a good way — numbers, strings, dates map cleanly. Then you hit the LOB columns: CLOB, NCLOB, BLOB, and XMLType. That’s where migrations quietly go wrong, first on correctness and then on speed.
The correct type mappings
Start here — get the target types right before anything else:
- CLOB → TEXT — PostgreSQL
textis unbounded; no need for a length. - NCLOB → TEXT — same; the national-charset distinction disappears once you’re in a UTF-8 database.
- BLOB → BYTEA — binary stays binary.
- XMLType → XML — PostgreSQL has a native
xmltype. Use it. - BFILE → (no real equivalent) — BFILE is a pointer to an OS file, not stored data; you migrate the bytes separately or store a path in
text.
The XMLType → jsonb footgun
It is tempting to map XMLType to jsonb — both are “semi-structured,” and jsonb is the trendy choice. Don’t. XML is not JSON. The moment you try to load an XML value into a json column, every row fails:
ERROR: invalid input syntax for type json
DETAIL: Token "<" is invalid.
CONTEXT: COPY big_data_types, line 1, column xml_col: "<ROW><ID>3</ID></ROW>"The fix is simply the right target type. PostgreSQL’s xmltype accepts XML content directly, validates it, and lets you query it with XPath later. If you genuinely want JSON, that’s a datatransformation (parse the XML, emit JSON) — a deliberate step, not a column type you can quietly swap in. Map the type faithfully first; transform data on purpose, never by accident.
The performance trap: LOBs are heavy per row
Correct types get the data in. Speed is the second, sneakier problem. A normal row is a few hundred bytes; a row with a couple of LOBs can be kilobytes to megabytes. Bulk loaders read in batches for throughput — but with LOB columns, every row in the batch carries its full LOB payload. A batch of 50,000 rows that’s fine for a skinny table becomes gigabytes of buffered memory on a LOB-heavy one. The loader thrashes, throughput collapses, and the table looks frozen even though it’s technically still running.
Two things make LOB reads fast and safe
- Fetch LOBs as values, not locators.By default many drivers return a LOB locator — a handle you then read with a separate round-trip per LOB. Reading thousands of locators one at a time is brutally slow. Configure the driver to fetch CLOB/NCLOB as strings and BLOB as bytes inline (in python-oracledb,
fetch_lobs=False) so the data streams with the row. - Use a smaller batch for LOB tables.The right batch size for a LOB-heavy table is much smaller than for the rest of the schema — hundreds of rows, not tens of thousands — to keep the per-fetch memory bounded. The migration should detect LOB columns and cap the batch automatically, not make you guess.
Validate LOBs by value, not by row count
Row counts won’t catch a truncated CLOB or a mis-encoded BLOB. After the load, validate LOB columns at the value level — a hash or checksum of the content, source vs target — so a silently shortened document or a botched binary is caught before cutover, not after a customer notices.
The short version
- CLOB/NCLOB →
text, BLOB →bytea, XMLType →xml(neverjsonb). - Fetch LOBs inline as str/bytes, not as locators.
- Give LOB-heavy tables a much smaller batch size than the rest.
- Validate LOB content by value, not by row count.
None of this is exotic — but every one of these is a place a migration passes its row-count check and still ships corrupted or half-loaded data. Treat LOBs as their own little project inside the migration and they stop being the part that goes wrong.