Note: This post was originally written in August 2009
Over the past week or so, I’ve been working on converting some legacy MySQL databases from their original character set to UTF-8. Although the process sounds easy enough, at least a few issues tend to crop up with every conversion. Whether it’s fields getting randomly truncated, tables containing data with multiple character sets, or data displaying as pure gibberish, something unexpected is bound to happen.
In this case, I was converting the data from a semi-large forum hosting site. This site had multiple language packs, each with its own character set. All of the data in the database tables was stored using the latin1 character set. In essence, although each forum had its own character set, the database treated the data as if it was using a single character set.
As expected, when the conversion was done and I asked for feedback, I was informed that one of the forums was displaying gibberish. Upon researching the problem, I discovered a few facts:
- The forum was using the Polish language pack, which used the ISO-8859-2 character set.
- All of the data appeared to be present on the site; it was just improperly encoded and thus looked like gibberish.
My first instinct was to try skipping the actual data conversion (for certain languages, including Polish, I was using MySQL’s charset support to convert the data from latin1 to UTF-8: it was easier than using an outside tool). Unfortunately, that resulted in fields being truncated and data going missing on import.
Thinking that the truncated data was due to an incorrect character set, I tried using iconv, an extremely useful tool for converting between character sets. I took the backup and told iconv to convert it from ISO-8859-2 to UTF-8:
iconv -f ISO-8859-2 -t UTF-8 < input-file.sql > converted-file.sql
Unfortunately, the fields remained truncated.
Upon further investigation, I realized the source of the problem: the original backup I had from the forum host was in UTF-8, not latin1! mysqldump, the tool I had used to generate the backups, uses UTF-8 by default: you need to specify if you want to use a different character set. In this case, since the tables were latin1 but the backup was supposed to be in UTF-8, MySQL was converting the data itself. Since not all of the data was latin1 to begin with, certain information was being lost in translation.
Once I realized my mistake, it was easy enough to correct. I made a new backup with the correct character set indicated:
mysqldump --default-character-set=latin1 -u username -p -h example.com database_name > backup.sql
and then I converted the file using iconv:
iconv -f ISO-8859-2 -t UTF-8 < backup.sql > converted-file.sql
Finally, with the data itself properly converted to UTF-8, the only remaining step was to update the database schema; I manually edited the database backup and changed mentions of latin1 to UTF-8. The import was a breeze and the forum was up and running in no time.