Forum: Rails I18n UTF8 vs Latin1 String comparison in MySQL?

5a60d6c66b7737ba71cecc4c46fc6fd5?d=identicon&s=25 Wolfram Arnold (wolframarnold)
on 2009-05-20 10:02
I'm running into a surprising twist in MySql string compare.

Try this:

select "América" = "America";
+------------------------+
| "América" = "America"  |
+------------------------+
|                      1 |
+------------------------+

My db and client are in UTF8:
show variables like "%character%";
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

MySql lets you force an encoding on a per string basis, if you try this:

_utf8"xxx" the string is in UTF8, of
_latin1"xxx" then "xxx" is in Latin 1.

Here is what's puzzling me:

 select _latin1"América" = _latin1"America";
+--------------------------------------+
| _latin1"América" = _latin1"America"  |
+--------------------------------------+
|                                    0 |
+--------------------------------------+

vs

select _utf8"América" = _utf8"America";
+----------------------------------+
| _utf8"América" = _utf8"America"  |
+----------------------------------+
|                                1 |
+----------------------------------+

Why does it think the UTF8 strings are the same when the Latin1 ones are
not??? What am I not getting?

Thanks!

Wolf
20ee0d23c969c7740d3c936a4675bb23?d=identicon&s=25 Sarah Allen (ultrasaurus)
on 2009-05-20 14:55
Wolfram Arnold wrote:
> I'm running into a surprising twist in MySql string compare.
>
> Try this:
>
> select "América" = "America";
> +------------------------+
> | "América" = "America"  |
> +------------------------+
> |                      1 |
> +------------------------+

I looked into this a bit this morning.  It seems that string comparison
is affected by "collation" not charset.  The default utf8 collation
yields the unexpected result:

mysql> SHOW VARIABLES LIKE '%collat%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> select "América" = "America";
+------------------------+
| "América" = "America"  |
+------------------------+
|                      1 |
+------------------------+
1 row in set (0.00 sec)

But if I change the collation to utf8_bin, the strings compare as
unequal:

mysql> set collation='utf8_bin';
Query OK, 0 rows affected (0.04 sec)

mysql> set collation_server='utf8_bin';
Query OK, 0 rows affected (0.00 sec)

mysql> set collation_database='utf8_bin';
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW VARIABLES LIKE '%collat%';
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| collation_connection | utf8_bin |
| collation_database   | utf8_bin |
| collation_server     | utf8_bin |
+----------------------+----------+
3 rows in set (0.00 sec)

mysql> select "América" = "America";;
+------------------------+
| "América" = "America"  |
+------------------------+
|                      0 |
+------------------------+
1 row in set (0.00 sec)

I'm a bit confused myself about how utf8_general_ci could sort
effectively if 'é' equals 'e' since users would expect all the like
characters to group together and why this collation would be useful; nor
do I understand if there are other implications to using utf8_bin
collation.

There is a nicely written report on Unicode collation here:
http://www.unicode.org/unicode/reports/tr10/

Sarah
This topic is locked and can not be replied to.