Forum: Rails I18n UTF8 vs Latin1 String comparison in MySQL?

Posted by Wolfram Arnold (wolframarnold)
on 2009-05-20 10:02
I'm running into a surprising twist in MySql string compare.

Try this:

select "América" = "America";
+------------------------+
| "América" = "America"  |
+------------------------+
|                      1 | 
+------------------------+

My db and client are in UTF8:
show variables like "%character%";
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | utf8                       | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
+--------------------------+----------------------------+

MySql lets you force an encoding on a per string basis, if you try this:

_utf8"xxx" the string is in UTF8, of
_latin1"xxx" then "xxx" is in Latin 1.

Here is what's puzzling me:

 select _latin1"América" = _latin1"America";
+--------------------------------------+
| _latin1"América" = _latin1"America"  |
+--------------------------------------+
|                                    0 | 
+--------------------------------------+

vs

select _utf8"América" = _utf8"America";
+----------------------------------+
| _utf8"América" = _utf8"America"  |
+----------------------------------+
|                                1 | 
+----------------------------------+

Why does it think the UTF8 strings are the same when the Latin1 ones are
not??? What am I not getting?

Thanks!

Wolf
Posted by Sarah Allen (ultrasaurus)
on 2009-05-20 14:55
Wolfram Arnold wrote:
> I'm running into a surprising twist in MySql string compare.
> 
> Try this:
> 
> select "América" = "America";
> +------------------------+
> | "América" = "America"  |
> +------------------------+
> |                      1 | 
> +------------------------+

I looked into this a bit this morning.  It seems that string comparison 
is affected by "collation" not charset.  The default utf8 collation 
yields the unexpected result:

mysql> SHOW VARIABLES LIKE '%collat%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci | 
| collation_database   | utf8_general_ci | 
| collation_server     | utf8_general_ci | 
+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> select "América" = "America";
+------------------------+
| "América" = "America"  |
+------------------------+
|                      1 | 
+------------------------+
1 row in set (0.00 sec)

But if I change the collation to utf8_bin, the strings compare as 
unequal:

mysql> set collation='utf8_bin';
Query OK, 0 rows affected (0.04 sec)

mysql> set collation_server='utf8_bin';
Query OK, 0 rows affected (0.00 sec)

mysql> set collation_database='utf8_bin';
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW VARIABLES LIKE '%collat%';
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| collation_connection | utf8_bin | 
| collation_database   | utf8_bin | 
| collation_server     | utf8_bin | 
+----------------------+----------+
3 rows in set (0.00 sec)

mysql> select "América" = "America";;
+------------------------+
| "América" = "America"  |
+------------------------+
|                      0 | 
+------------------------+
1 row in set (0.00 sec)

I'm a bit confused myself about how utf8_general_ci could sort 
effectively if 'é' equals 'e' since users would expect all the like 
characters to group together and why this collation would be useful; nor 
do I understand if there are other implications to using utf8_bin 
collation.

There is a nicely written report on Unicode collation here: 
http://www.unicode.org/unicode/reports/tr10/

Sarah
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.