I'm running into a surprising twist in MySql string compare. Try this: select "América" = "America"; +------------------------+ | "América" = "America" | +------------------------+ | 1 | +------------------------+ My db and client are in UTF8: show variables like "%character%"; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ MySql lets you force an encoding on a per string basis, if you try this: _utf8"xxx" the string is in UTF8, of _latin1"xxx" then "xxx" is in Latin 1. Here is what's puzzling me: select _latin1"América" = _latin1"America"; +--------------------------------------+ | _latin1"América" = _latin1"America" | +--------------------------------------+ | 0 | +--------------------------------------+ vs select _utf8"América" = _utf8"America"; +----------------------------------+ | _utf8"América" = _utf8"America" | +----------------------------------+ | 1 | +----------------------------------+ Why does it think the UTF8 strings are the same when the Latin1 ones are not??? What am I not getting? Thanks! Wolf
on 2009-05-20 10:02
on 2009-05-20 14:55
Wolfram Arnold wrote: > I'm running into a surprising twist in MySql string compare. > > Try this: > > select "América" = "America"; > +------------------------+ > | "América" = "America" | > +------------------------+ > | 1 | > +------------------------+ I looked into this a bit this morning. It seems that string comparison is affected by "collation" not charset. The default utf8 collation yields the unexpected result: mysql> SHOW VARIABLES LIKE '%collat%'; +----------------------+-----------------+ | Variable_name | Value | +----------------------+-----------------+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +----------------------+-----------------+ 3 rows in set (0.00 sec) mysql> select "América" = "America"; +------------------------+ | "América" = "America" | +------------------------+ | 1 | +------------------------+ 1 row in set (0.00 sec) But if I change the collation to utf8_bin, the strings compare as unequal: mysql> set collation='utf8_bin'; Query OK, 0 rows affected (0.04 sec) mysql> set collation_server='utf8_bin'; Query OK, 0 rows affected (0.00 sec) mysql> set collation_database='utf8_bin'; Query OK, 0 rows affected (0.00 sec) mysql> SHOW VARIABLES LIKE '%collat%'; +----------------------+----------+ | Variable_name | Value | +----------------------+----------+ | collation_connection | utf8_bin | | collation_database | utf8_bin | | collation_server | utf8_bin | +----------------------+----------+ 3 rows in set (0.00 sec) mysql> select "América" = "America";; +------------------------+ | "América" = "America" | +------------------------+ | 0 | +------------------------+ 1 row in set (0.00 sec) I'm a bit confused myself about how utf8_general_ci could sort effectively if 'é' equals 'e' since users would expect all the like characters to group together and why this collation would be useful; nor do I understand if there are other implications to using utf8_bin collation. There is a nicely written report on Unicode collation here: http://www.unicode.org/unicode/reports/tr10/ Sarah
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.