Help with str.sub(pattern, replacement) and french characters

I am trying to search my database which has product names with french
accents. They are encoded using the html entity codes such as &eactue;
etc.

If a user enters a word with a french accent in the search box I must
convert it to the html entity code so it can be found in the database.

So I thought to use str.sub(pattern, replacement) => new_str

However when I try this using product.sub(‘é’,‘é’) for example it
results in the following:

find(:all, :select => ‘product_id, name’, :order => “name”, :conditions
=> [“name like ? and locale =?”, “%#{product.sub(‘é’,‘é’)}%”,
I18n.locale])

When I enter ‘é’ in the seach box I get the following:

SELECT product_id, name FROM product_descriptions WHERE (name like
‘%é%’ and locale =‘en’)

so it does not replace the ‘é’ with ‘&eactue’

But if I change the letter from ‘é’ to ‘e’ and do a search for ‘e’ I
get the following:

SELECT product_id, name FROM product_descriptions WHERE (name like
‘%é%’ and locale =‘en’)

so the replacement works.

Can anyone explain why it won’t work for the character with french
accent?

Thank you in advance.
Mitch

Ideally, you shouldn’t have HTML entities in the database. If you need
them
in your HTML (and you don’t, if you set an explicit encoding, except for
things like &<>) then you should add them outside the database.

If you do have “” stored in the database, not as an entity, I believe
MySQL’s “LIKE” will be accent-insensitive by default, unless you use
“COLLATE utf8_bin” (google for details).

Note that if you use “sub”, you will only replace the first occurrence
in
the string. You probably want “gsub”.

And if you do something like “blh”.sub("", “é”) and it doesn’t
replace the “”, the issue could be how the “” is represented. In UTF-8,
accented characters can be represented either composed as a single glyph
(“latin small letter e with acute”) or decomposed as two glyphs: “latin
small letter e” + “combining acute accent”. So if your string contains
the
first type of and your sub/gsub tries to replace the other type, it
won’t
work. You can normalize the string to ensure everything is composed or
decomposed, but it would be better not to have entities in the database.

Henrik — wrote in post #976928:

Ideally, you shouldn’t have HTML entities in the database. If you need
them
in your HTML (and you don’t, if you set an explicit encoding, except for
things like &<>) then you should add them outside the database.

If you do have “” stored in the database, not as an entity, I believe
MySQL’s “LIKE” will be accent-insensitive by default, unless you use
“COLLATE utf8_bin” (google for details).

Note that if you use “sub”, you will only replace the first occurrence
in
the string. You probably want “gsub”.

And if you do something like “blh”.sub("", “é”) and it doesn’t
replace the “”, the issue could be how the “” is represented. In UTF-8,
accented characters can be represented either composed as a single glyph
(“latin small letter e with acute”) or decomposed as two glyphs: “latin
small letter e” + “combining acute accent”. So if your string contains
the
first type of and your sub/gsub tries to replace the other type, it
won’t
work. You can normalize the string to ensure everything is composed or
decomposed, but it would be better not to have entities in the database.

Ok I will take your advise and remove the html entities from the
database.
The reason I put them in was because even with explicit encoding I was
not getting the characters to show properly. I was getting a black
triangle with a question mark.

Could you assist me on how to encode the web page so that it shows the
accents. I thought you just use UTF-8?

Thanks for your help. I really appreciate it.

Yes, encode the file in UTF-8 and add a tag like this on your head
section:

On Mon, Jan 24, 2011 at 00:14, Mitchell G. [email protected]
wrote:

Ok I will take your advise and remove the html entities from the
database.
The reason I put them in was because even with explicit encoding I was
not getting the characters to show properly. I was getting a black
triangle with a question mark.

Could you assist me on how to encode the web page so that it shows the
accents. I thought you just use UTF-8?

Check what your browser thinks the encoding is. Check that UTF-8 is
declared
in the HTTP headers or a meta element (and if they disagree, I’m not
entirely sure what goes - research that).
UTF-8: The Secret of Character Encoding - HTML Purifier has some info.

Also ensure the font you’re using can handle that glyph. I would guess
most
fonts can display . But if everything else looks right, try some
standard
font like Times and see what happens.

I removed some HTML entities from my database to test the effect. I made
sure my web page is UTF-8 encoded.

Now instead of “électronique” I get name: “\xC9lectronic” where
the"\xC9" displays like a black triangle with a “?” in it.

I also changed the font to times.

I read up and learned that MYSQL might be delivering the characters in a
format other than UTF-8.

I changed my database, table, and field to be UTF-8.

I still get the same problem as stated above.

What gives?

Thanks in advance

MItch

Hi,
I figured it all out. I need to explicitly tell Rails that the database
is

using utf8 encoding by putting the following in the database.yml file

encoding: utf8

now it displays perfectly.

I hope this is still in line with best practices as I don’t want to mess
this up again.

Thanks

MItch