UTF8 vs Latin1 String comparison in MySQL?

wolframarnold · May 20, 2009, 10:02am

I’m running into a surprising twist in MySql string compare.

Try this:

select “AmÃ©rica” = “America”;
±-----------------------+
| “AmÃ©rica” = “America” |
±-----------------------+
| 1 |
±-----------------------+

MySql lets you force an encoding on a per string basis, if you try this:

_utf8"xxx" the string is in UTF8, of
_latin1"xxx" then “xxx” is in Latin 1.

Here is what’s puzzling me:

select _latin1"AmÃ©rica" = _latin1"America";
±-------------------------------------+
| _latin1"AmÃ©rica" = _latin1"America" |
±-------------------------------------+
| 0 |
±-------------------------------------+

vs

select _utf8"AmÃ©rica" = _utf8"America";
±---------------------------------+
| _utf8"AmÃ©rica" = _utf8"America" |
±---------------------------------+
| 1 |
±---------------------------------+

Why does it think the UTF8 strings are the same when the Latin1 ones are
not??? What am I not getting?

Thanks!

Wolf

wolframarnold · May 20, 2009, 2:55pm

Wolfram Arnold wrote:

I’m running into a surprising twist in MySql string compare.

Try this:

select “AmÃ©rica” = “America”;
±-----------------------+
| “AmÃ©rica” = “America” |
±-----------------------+
| 1 |
±-----------------------+

I looked into this a bit this morning. It seems that string comparison
is affected by “collation” not charset. The default utf8 collation
yields the unexpected result:

mysql> select “AmÃ©rica” = “America”;
±-----------------------+
| “AmÃ©rica” = “America” |
±-----------------------+
| 1 |
±-----------------------+
1 row in set (0.00 sec)

But if I change the collation to utf8_bin, the strings compare as
unequal:

mysql> set collation=‘utf8_bin’;
Query OK, 0 rows affected (0.04 sec)

mysql> set collation_server=‘utf8_bin’;
Query OK, 0 rows affected (0.00 sec)

mysql> set collation_database=‘utf8_bin’;
Query OK, 0 rows affected (0.00 sec)

mysql> select “AmÃ©rica” = “America”;;
±-----------------------+
| “AmÃ©rica” = “America” |
±-----------------------+
| 0 |
±-----------------------+
1 row in set (0.00 sec)

I’m a bit confused myself about how utf8_general_ci could sort
effectively if ‘Ã©’ equals ‘e’ since users would expect all the like
characters to group together and why this collation would be useful; nor
do I understand if there are other implications to using utf8_bin
collation.

There is a nicely written report on Unicode collation here:
http://www.unicode.org/unicode/reports/tr10/

Sarah