We had a problem at TheClymb: our database and tables were created with the default Latin1 encoding. Now all of the data in those tables is actually UTF8 because it was all imported via the web browser (which defaults to UTF8) and MySQL doesn’t actually validate or convert data encoding when inserting.
A suggestion was to just set this in an initializer:
This will solve your problems in Ruby but will not solve your problems in the database: MySQL will still sort and compare strings thinking they are latin1 and thus do so incorrectly. Here’s an example:
I’m not a linguist but to the best of my knowledge the German ß is essentially “ss”. When we ask MySQL to sort our names, you can see that the UTF-8 results put the ß character between “r” and “ss” but the Latin1 results don’t. If a German were to see this, they would be enraged due to your culturally insensitive code!
> select * from names_latin1 order by name; +-----------------+ | name | +-----------------+ | Martin Straure | | Martin Strausse | | Martin Straute | | Martin Strauße | +-----------------+
> select * from names_utf8 order by name; +-----------------+ | name | +-----------------+ | Martin Straure | | Martin Strauße | | Martin Strausse | | Martin Straute | +-----------------+
We need to update the CHARACTER SET without doing any conversion of the data. This is simple to do: you convert the columns to a blob format and then convert them back to a string format with the proper encoding declared; MySQL will not do any conversion of raw binary data. For example:
With this in mind, I wrote a rake task to convert our application’s database. Here’s the full script in a Github Gist. You’ll need to run it with a
DOIT parameter to actually make the changes otherwise it will just print the SQL it will run to the terminal. The script will take a long time for large databases since it has to ALTER TABLE, which means MySQL will write out the table to disk in full; the script does minimize the number of ALTER TABLEs it runs to two per table.
rake convert_to_utf8 DOIT=1
So please think of the Germans and the rest of our international friends: converting the character set of your database to the proper value is important to get correct sorting of results.