Migration

Character Set Migration: Oracle to PostgreSQL UTF-8

Rakesh Mamidala·Founder & Lead Engineer··6 min read

The Migration That Looks Fine Until the Accents Break

Character-set migration is the one that passes every row count and then surfaces weeks later as a customer named José showing up as Jos?. The fix is mostly a single good decision made up front — target PostgreSQL in UTF-8 — plus an honest audit of what your Oracle source actually stores versus what it claims to store.

Always Target UTF-8

PostgreSQL's UTF8 encoding is the safe default and maps cleanly to Oracle's AL32UTF8. Create the database explicitly in UTF-8 and don't look back — it covers every script, every emoji, every language, and it's what the rest of the modern stack expects.

sql
-- Check the Oracle source character set first
SELECT value FROM nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET';
-- e.g. AL32UTF8  (already Unicode)  or  WE8MSWIN1252 (single-byte)

-- Target: PostgreSQL in UTF-8
CREATE DATABASE sales_pg ENCODING 'UTF8' LC_COLLATE 'en_US.UTF-8' LC_CTYPE 'en_US.UTF-8' TEMPLATE template0;

The BYTE-vs-CHAR Trap

Oracle columns are often declared in bytes: VARCHAR2(10 BYTE) holds 10 bytes. In a single-byte source charset that's also 10 characters — but in UTF-8 a single accented or CJK character can take 2–4 bytes. PostgreSQL VARCHAR(n) always counts characters, not bytes, so a faithful migration must size on characters, and data that fit Oracle's byte limit may need a wider column — or it overflows on load.

sql
-- Oracle: 'café' is 4 chars but 5 bytes in AL32UTF8
SELECT LENGTH('café') AS chars, LENGTHB('café') AS bytes FROM dual;  -- 4, 5

-- Oracle VARCHAR2(10 BYTE) holding multibyte data → size PG on characters
-- Safe rule: map VARCHAR2(n BYTE) → VARCHAR(n) and audit columns near the limit,
-- or use CHAR semantics at the source: VARCHAR2(10 CHAR)

Legacy Single-Byte & the US7ASCII Landmine

Migrating from a single-byte charset (WE8MSWIN1252, WE8ISO8859P1) to UTF-8 is a straightforward transcode — the driver converts on read. The landmine is US7ASCII databases that have 8-bit data stuffed into a 7-bit charset for years: Oracle never validated it, so the bytes are whatever the app wrote. Transcoding that to UTF-8 surfaces every invalid byte as garbage or an error. Audit it before you cut over, don't discover it during.

Validate at the Byte Level, Not the Row Level

A mangled encoding doesn't change row counts — every row still arrives. Only value-level validation — a checksum or per-partition Merkle fingerprint over the actual decoded strings — catches a column where Ω became O or a smart-quote turned into mojibake. Count-only validation passes a botched transcode straight to production.

How DBMigrateAIPro Handles It

The tool targets UTF-8 by default, flags VARCHAR2(n BYTE) columns whose data is close to the byte limit so multibyte expansion can't overflow them, surfaces US7ASCII sources with non-ASCII bytes as a gap to review, and validates the migrated values at the column level with Merkle fingerprints — so a broken character shows up as a DATA MISMATCH, not a support ticket.

Don't let mojibake ship on cutover

DBMigrateAIPro targets UTF-8, sizes columns on characters, flags the byte-limit and US7ASCII risks, and proves the text landed intact with column-level validation. Free for Year 1 — no signup.