UTF8 vs Latin1 String comparison in MySQL?


#1

I’m running into a surprising twist in MySql string compare.

Try this:

select “América” = “America”;
±-----------------------+
| “América” = “America” |
±-----------------------+
| 1 |
±-----------------------+

My db and client are in UTF8:
show variables like “%character%”;
±-------------------------±---------------------------+
| Variable_name | Value |
±-------------------------±---------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
±-------------------------±---------------------------+

MySql lets you force an encoding on a per string basis, if you try this:

_utf8"xxx" the string is in UTF8, of
_latin1"xxx" then “xxx” is in Latin 1.

Here is what’s puzzling me:

select _latin1"América" = _latin1"America";
±-------------------------------------+
| _latin1"América" = _latin1"America" |
±-------------------------------------+
| 0 |
±-------------------------------------+

vs

select _utf8"América" = _utf8"America";
±---------------------------------+
| _utf8"América" = _utf8"America" |
±---------------------------------+
| 1 |
±---------------------------------+

Why does it think the UTF8 strings are the same when the Latin1 ones are
not??? What am I not getting?

Thanks!

Wolf


#2

Wolfram Arnold wrote:

I’m running into a surprising twist in MySql string compare.

Try this:

select “América” = “America”;
±-----------------------+
| “América” = “America” |
±-----------------------+
| 1 |
±-----------------------+

I looked into this a bit this morning. It seems that string comparison
is affected by “collation” not charset. The default utf8 collation
yields the unexpected result:

mysql> SHOW VARIABLES LIKE ‘%collat%’;
±---------------------±----------------+
| Variable_name | Value |
±---------------------±----------------+
| collation_connection | utf8_general_ci |
| collation_database | utf8_general_ci |
| collation_server | utf8_general_ci |
±---------------------±----------------+
3 rows in set (0.00 sec)

mysql> select “América” = “America”;
±-----------------------+
| “América” = “America” |
±-----------------------+
| 1 |
±-----------------------+
1 row in set (0.00 sec)

But if I change the collation to utf8_bin, the strings compare as
unequal:

mysql> set collation=‘utf8_bin’;
Query OK, 0 rows affected (0.04 sec)

mysql> set collation_server=‘utf8_bin’;
Query OK, 0 rows affected (0.00 sec)

mysql> set collation_database=‘utf8_bin’;
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW VARIABLES LIKE ‘%collat%’;
±---------------------±---------+
| Variable_name | Value |
±---------------------±---------+
| collation_connection | utf8_bin |
| collation_database | utf8_bin |
| collation_server | utf8_bin |
±---------------------±---------+
3 rows in set (0.00 sec)

mysql> select “América” = “America”;;
±-----------------------+
| “América” = “America” |
±-----------------------+
| 0 |
±-----------------------+
1 row in set (0.00 sec)

I’m a bit confused myself about how utf8_general_ci could sort
effectively if ‘é’ equals ‘e’ since users would expect all the like
characters to group together and why this collation would be useful; nor
do I understand if there are other implications to using utf8_bin
collation.

There is a nicely written report on Unicode collation here:
http://www.unicode.org/unicode/reports/tr10/

Sarah