PostgreSQL: Convert Windows-1252 ISO-8859-1-superset characters to UTF-8
From FVue
Problem
Characterset Windows-1252 has - compared to ISO-8859-1 - special displayable characters rather than control characters in the 0x80 to 0x9F range. When a Windows-1252 text is converted to Unicode with the source characterset specified as "ISO-8859-1", these special characters will not be converted to Unicode.
How can I convert these special characters to their Unicode equivalent? If I try this code underneath, I get an error:
database# UPDATE table SET text = REPLACE(text, '\x80', '€');
WARNING: nonstandard use of escape in a string literal
LINE 2: text = REPLACE(text, '\x91', '€')
^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR: invalid byte sequence for encoding "UTF8": 0x91
HINT: This error can also happen if the byte sequence does not match the encoding
expected by the server, which is controlled by "client_encoding".
Environment
- PostgreSQL 8.4, 9.3
Solution
Here's an UPDATE statement for the entire 0x80 to 0x9F range, replace mytable and myfield with yours:
UPDATE mytable SET myfield = REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(myfield, U&'\0080', '€'), U&'\0081', '' ), U&'\0082', '‚'), U&'\0083', 'ƒ'), U&'\0084', '„'), U&'\0085', '…'), U&'\0086', '†'), U&'\0087', '‡'), U&'\0088', 'ˆ'), U&'\0089', '‰'), U&'\008a', 'Š'), U&'\008b', '‹'), U&'\008c', 'Œ'), U&'\008d', '' ), U&'\008e', 'Ž'), U&'\008f', '' ), U&'\0090', '' ), U&'\0091', '‘'), U&'\0092', '’'), U&'\0093', '“'), U&'\0094', '”'), U&'\0095', '•'), U&'\0096', '–'), U&'\0097', '—'), U&'\0098', '˜'), U&'\0099', '™'), U&'\009a', 'š'), U&'\009b', '›'), U&'\009c', 'œ'), U&'\009d', '' ), U&'\009e', 'ž'), U&'\009f', 'Ÿ');
If you get this error:
ERROR: unsafe use of string constant with Unicode escapes DETAIL: String constants with Unicode escapes cannot be used when standard_conforming_strings is off.
First do this:
SET standard_conforming_strings=ON;
See also
http://en.wikipedia.org/wiki/Windows-1252
Advertisement