Friday, January 20, 2012

Encoding and mysql

Knowing how encoding works is crucial when dealing with non-ASCII characters in your source code or database. If your data is numeric, or ASCII based, you could never know that there is such thing, but most of the time this is a issue. If you're dealing with multilingual data, it is definitely a important thing.

MySQL has four encoding settings: the server encoding, the table encoding, the column encoding, and the client encoding. Server encoding is the default encoding tables and columns will follow if not specified. Table encoding supersedes the Server if the column's is not definied. Finally, client encoding defines to which one should retrieved data be displayed.

I once had a problem with some mysql database texts being corrupted, only the non-ASCII characters on portuguese text. The problem, basically, was that data was generated on an old version of Excel, which by default saved CSV files with Latin-1 encoding, and the same data was loaded on some table which were defined to be encoded on UTF8.

The command

SELECT * FROM t1

in a Latin1 client retrieved garbage (unreadable in both Latin1 or UTF8 terminals), but the same command in a UTF8 client retrieved data in the original Latin1 encoding!

Below are some tips to diagnose and solve those problems.

To sort out if your data is stored on a different encoding from what you defined on the table or column the problem is, on *NIX, I open a terminal emulator, ensure that its encoding is UTF-8, connect to mysql using the --character-set utf8 option and then select data known to have non-standard characters. If everything is right,

I always use the *NIX terminal to perform these checks. First, ensure which encoding your terminal emulator is using; then, connect to the database using the same encoding as the terminal. Then check which encoding is the table and the column.

If when you SELECT data, it appears garbled, you can be sure that data isn't encoded as the table tells it is. Then you can connect to mysql using the same encoding as the table is defined (even if it is not the same as the terminal); if the same SELECT retrieves data correctly displayed, then your data is in the same encoding as your terminal and you should ALTER TABLE accordingly.

If your data is still garbled, try switching the terminal's encoding and doing the SELECT over and over, until it is displayed correctly. If so, change the table/column encoding to match the terminal's and everything must be fine.

If things still aren't worked out, your data has been corrupted, probably because it has a certain encoding X and was converted from another encoding Y to Z (or even back to X). Making the other way round may work to get thing back, but sometimes data is lost permanently.

The problem using GUIs to sort things out is that sometimes you don't really know which encoding the GUI itself is using. I don't imagine how to deal with it on Windows. *NIX Terminal Emulators I know usually have this setting in a place easy to be found, and mysql terminal client is also very easy to define this.

Also, I found out that the encoding defined on tables and columns doesn't affect how data is stored, but only how it is retrieved. When the table/column encoding matches the client's, data is retrieved as it was stored initially. When they don't match, results will be converted FROM table/column encoding TO client encoding.

So if your data is latin1 encoded, your table is defined as utf8, and you set the client to latin1, results will be converted to latin1, AS IF IT WERE utf8! It will completely mess your data!

You could set whatever encoding you want, as long as the client encoding matches the data the data will be retrieved unchanged.

It is also good practice to check (using the command "file") the encoding of any file before loading it in the database. "iconv" is helpful when translating files between encodings.

Finally, the "encoding collation" property interferes on how LIKE and other operators perform on those kind of character. For instance, if you set "utf8-general-ci", "ç" LIKE "c" = 1, but for "utf8-bin" it yields 0.

This page is powered by Blogger. Isn't yours?