Encoding hell, grep and iconv salvage!^edit

16 Dec 2014

2 mins

Nowadays we inherit a lot of old databases. The typical problem is to extract data from badly encoded fields. This happens when the browser encoding is forced to let say UTF8 and MySQL is accepting the the default LATIN1 encoding. In this case the problem does not manifests immediately since the byte sequence corresponding to the single character remains immute during the saving and retrieval, but become a problem when dumped and migrated.

Lets get workaround this problem. At first find non ASCII characters in the dump file

grep --color='auto' -P "[\x80-\xFF]" FILENAME

Now let’s work it out with iconv

iconv --verbose -f LATIN1 -t UTF8//TRANSLIT FILENAME_latin1 > FILENAME_utf8

If you get the followinf message

iconv: illegal input sequence at position <NUMBER>

this is a good sign of badly encoded character, you may correct it with vim, just type in command mode

:goto <NUMBER>

Taking into account that you’re working with UTF8 locale session in terminal

user@host:~$ locale 
LANG=en_US.UTF-8
LANGUAGE=en_US:
LC_CTYPE="en_US.UTF-8"

After you’re finished, just save the file and import it into UTF8 encoded fields of the database!

Igor Moiseev Applied mathematician, AI Enthusiast

Encoding hell, grep and iconv salvage!edit

Related Posts

Foundation Models, The AI Frontier 27 Jun 2025

Train Resnet50 on ImageNet with PyTorch 18 Dec 2022

Find all tables without primary key in PostgreSQL 09 Dec 2022

Encoding hell, grep and iconv salvage!^edit