Encoding hell, grep and iconv salvage!edit16 Dec 2014
Nowadays we inherit a lot of old databases.
The typical problem is to extract data from badly encoded fields.
This happens when the browser encoding is forced to let say
MySQL is accepting the the default
LATIN1 encoding. In this case
the problem does not manifests immediately since the byte sequence corresponding to
the single character remains immute during the saving and retrieval, but become a problem
when dumped and migrated.
Lets get workaround this problem. At first find non
ASCII characters in the dump file
grep --color='auto' -P "[\x80-\xFF]" FILENAME
Now let’s work it out with
iconv --verbose -f LATIN1 -t UTF8//TRANSLIT FILENAME_latin1 > FILENAME_utf8
If you get the followinf message
iconv: illegal input sequence at position <NUMBER>
this is a good sign of badly encoded character, you may correct it with vim, just type in command mode
Taking into account that you’re working with
UTF8 locale session in terminal
user@host:~$ locale LANG=en_US.UTF-8 LANGUAGE=en_US: LC_CTYPE="en_US.UTF-8"
After you’re finished, just save the file and import it into
UTF8 encoded fields of the database!