Strange matching: maybe a multilanguage collation problem?

Francis_H · September 22, 2006, 1:47am

Hi,

We’re using Ferret in a slightly unorthodox way: We’re indexing a
large (>100,000) list of names of places all around the world. Mostly
we’re quite happy with it, and have been able to graft on our own
particular required functionality with just a little tweaking.

There’s one strange problem, though: We’ve got a place in Cyprus
called “Gazima\304\237usa” (that \304\237 is a multibyte character in
UTF-8), and it matches a search for “usa”. We’d rather it not match.
I don’t know that much about Ferret or about this sort of indexing in
general, but is this because Ferret views \304\237 as a word break,
and splits the name into two words? If so, is there a way you’d
recommend to get around this – keeping in mind that we’ve got names
in romanized forms of many different languages?

Thanks in advance,

Francis

Francis_H · September 22, 2006, 8:22am

On 9/22/06, Francis H. [email protected] wrote:

I don’t know that much about Ferret or about this sort of indexing in
general, but is this because Ferret views \304\237 as a word break,
and splits the name into two words? If so, is there a way you’d
recommend to get around this – keeping in mind that we’ve got names
in romanized forms of many different languages?

Thanks in advance,

Francis

Hi Francis,

It is because Ferret sees that as a word break. This must be either
because you are using an ASCII Analzyer (which I doubt) or your locale
isn’t set to handle UTF-8. You can set your locale like this:

ENV['LANG'] = 'en_US.utf8'

Or use whatever locale your data is stored as. Let me know if that
helps.

Cheers,
Dave

PS: if not all your data is UTF-8 you may need to convert it. In that
case you should check out the Ruby’s iconv standard library.

Francis_H · September 23, 2006, 3:54am

On Sep 21, 2006, at 10:20 PM, David B. wrote:

UTF-8), and it matches a search for “usa”. We’d rather it not match.
Hi Francis,
Cheers,
Dave

PS: if not all your data is UTF-8 you may need to convert it. In that
case you should check out the Ruby’s iconv standard library.

I tried that and it made no difference. The data is in UTF-8 already.
And as far as the analyzer, we’re just using the StandardAnalyzer. (I
actually don’t know much about what all the different analyzers do,
at any rate.) Any other ideas?

Francis

Francis_H · September 23, 2006, 10:51am

On 9/23/06, Francis H. [email protected] wrote:

There’s one strange problem, though: We’ve got a place in Cyprus
Francis
helps.
at any rate.) Any other ideas?

Francis

Hi Francis,

I don’t really have any other ideas. Did you re-index the data after
you set ENV[“LANG”]? Could you try this code and tell me what you get;

require 'rubygems'
require 'ferret'
p Ferret::VERSION # 0.10.6
p Ferret::locale # "en_US.UTF-8"

index = Ferret::I.new()

index << {:place => "Gazima\304\237usa"}
index << {:place => "U.S.A."}
puts "Search: USA"
index.search_each("USA") {|id, score| puts index[id][:place]}
# Search: USA
# U.S.A.

puts "Search: Gazima\304\237usa"
index.search_each("Gazima\304\237usa") {|id, score| puts

index[id][:place]}
# Search: Gazimaðusa
# Gazimaðusa

Cheers,
Dave

Francis_H · September 28, 2006, 6:38pm

On Sep 23, 2006, at 12:56 AM, David B. wrote:

index << {:place => "Gazima\304\237usa"}
# GazimaÄ?usa

In the end, setting ENV[‘LANG’] didn’t seem to have an effect, but
setting Ferret::locale directly seems to work:

Ferret::locale = ‘en_US.UTF-8’

Thanks!

Francis