Encoding hell, grep and iconv salvage!edit
16 Dec 2014Nowadays we inherit a lot of old databases.
The typical problem is to extract data from badly encoded fields.
This happens when the browser encoding is forced to let say UTF8
and MySQL
is accepting the the default LATIN1
encoding. In this case
the problem does not manifests immediately since the byte sequence corresponding to
the single character remains immute during the saving and retrieval, but become a problem
when dumped and migrated.
Lets get workaround this problem. At first find non ASCII
characters in the dump file
grep --color='auto' -P "[\x80-\xFF]" FILENAME
Now let’s work it out with iconv
iconv --verbose -f LATIN1 -t UTF8//TRANSLIT FILENAME_latin1 > FILENAME_utf8
If you get the followinf message
iconv: illegal input sequence at position <NUMBER>
this is a good sign of badly encoded character, you may correct it with vim, just type in command mode
:goto <NUMBER>
Taking into account that you’re working with UTF8
locale session in terminal
user@host:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:
LC_CTYPE="en_US.UTF-8"
After you’re finished, just save the file and import it into UTF8
encoded fields of the database!