PostgreSQL: Convert Windows-1252 ISO-8859-1-superset characters to UTF-8

From FVue

Jump to: navigation, search

Contents

Problem

Characterset Windows-1252 has - compared to ISO-8859-1 - special displayable characters rather than control characters in the 0x80 to 0x9F range. When a Windows-1252 text is converted to Unicode with the source characterset specified as "ISO-8859-1", these special characters will not be converted to Unicode.

How can I convert these special characters to their Unicode equivalent? If I try this code underneath, I get an error:

database# UPDATE table SET text = REPLACE(text, '\x80', '€');
WARNING:  nonstandard use of escape in a string literal 
LINE 2:     text = REPLACE(text, '\x91', '€')
                                 ^
HINT:  Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR:  invalid byte sequence for encoding "UTF8": 0x91
HINT:  This error can also happen if the byte sequence does not match the encoding 
expected by the server, which is controlled by "client_encoding".

Solution

Do not escape the special characters, but insert them directly into the sql. Here's an UPDATE statement for the entire 0x80 to 0x9F range, replace mytable and myfield with yours:

UPDATE mytable
SET myfield =
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(
    REPLACE( REPLACE( REPLACE( REPLACE(myfield,
    E'\\u0080', '€'), E'\\u0081', '' ), E'\\u0082', '‚'), E'\\u0083', 'ƒ'),
    E'\\u0084', '„'), E'\\u0085', '…'), E'\\u0086', '†'), E'\\u0087', '‡'),
    E'\\u0088', 'ˆ'), E'\\u0089', '‰'), E'\\u008a', 'Š'), E'\\u008b', '‹'),
    E'\\u008c', 'Œ'), E'\\u008d', '' ), E'\\u008e', 'Ž'), E'\\u008f', '' ),
    E'\\u0090', '' ), E'\\u0091', '‘'), E'\\u0092', '’'), E'\\u0093', '“'),
    E'\\u0094', '”'), E'\\u0095', '•'), E'\\u0096', '–'), E'\\u0097', '—'),
    E'\\u0098', '˜'), E'\\u0099', '™'), E'\\u009a', 'š'), E'\\u009b', '›'),
    E'\\u009c', 'œ'), E'\\u009d', '' ), E'\\u009e', 'ž'), E'\\u009f', 'Ÿ');

See also

http://en.wikipedia.org/wiki/Windows-1252

Comments

blog comments powered by Disqus

Personal tools