Ferret and non latin characters support


#1

I’ve successfully installed ferret and acts_as_ferret and have no
problem with utf-8 for accented characters. It returns correct results
fot e.g. français. My problem is with non latin characters (Persian
indeed). I have tested different locales with no success both on Debian
and Mac. Any idea?
(ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6)


#2

David B. wrote:

I’m afraid I have no experience with Persian text. If you send me an
example of some text I’ll have a look and see what I can do.

Hi David,
This is not specific to Persian as I tested with more languages (Hebrew,
Japanese…). By the way this is a persian sample:
شکرشکن شوند همه طوطیان هند. زین قند پارسی که به بنگاله می‌رود.

Thanks,
Reza


#3

On 4/8/07, Reza Y. removed_email_address@domain.invalid wrote:

I’ve successfully installed ferret and acts_as_ferret and have no
problem with utf-8 for accented characters. It returns correct results
fot e.g. français. My problem is with non latin characters (Persian
indeed). I have tested different locales with no success both on Debian
and Mac. Any idea?
(ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6)

Hi Reza,

I’m afraid I have no experience with Persian text. If you send me an
example of some text I’ll have a look and see what I can do.

Cheers,
Dave


#4

On 4/9/07, Reza Y. removed_email_address@domain.invalid wrote:

David B. wrote:

I’m afraid I have no experience with Persian text. If you send me an
example of some text I’ll have a look and see what I can do.

Hi David,
This is not specific to Persian as I tested with more languages (Hebrew,
Japanese…). By the way this is a persian sample:
شکرشکن شوند همه طوطیان هند. زین قند پارسی که به بنگاله میرود.

Hi Reza,

Here is my test code;

require 'rubygems'
require 'ferret'

text = "شکرشکن شوند همه طوطیان هند. زین قند پارسی که به بنگاله 

میرود."
include Ferret::Analysis
tokenizer = StandardAnalyzer.new.token_stream(:field, text)
while token = tokenizer.next
puts token
end

And this is what I got as the output;

token["شکرشکن":0:12:1]
token["شوند":13:21:1]
token["همه":22:28:1]
token["طوطیان":29:41:1]
token["هند":42:48:1]
token["زین":50:56:1]
token["قند":57:63:1]
token["پارسی":64:74:1]
token["Ú©Ù‡":75:79:1]
token["به":80:84:1]
token["بنگاله":85:97:1]
token["میرود":98:108:1]

I guess this is probably the same as what you got but I’m not exactly
sure what is wrong with it. If you could explain what it should be
doing then I may be able to work out what is wrong.

Cheers,
Dave


#5
tokenizer = StandardAnalyzer.new.token_stream(:field, text)

Thanks Dave,
but StandardAnalyzer doesn’t work for me for non-latin text (tokenizer
returns nil). I tested with edge Ferret and tried different
Ferret.locale. Can you guess what’s wrong?

ruby 1.8.4 (2005-12-24) [powerpc-darwin8.6.0],
powerpc-apple-darwin8-gcc-4.0.1

Best,
Reza


#6

i am seeing the same problem as reza - tokenizer.next returns nil.

another sample

text = “^德国科隆大学,北京大学,清华大学,同济大学, University of Cologne”

returns only:
token[“university”:66:76:1]
token[“cologne”:80:87:2]

ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-darwin8.8.2]
ferret 0.11.4

kind regards,
phillip


#7

same problem on our debian servers :frowning:

  • ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-linux]
  • Linux s15215947 2.6.16-rc6-060319a #1 SMP Sun Mar 19 16:28:15 CET 2006
    i686 GNU/Linux

kind regards,
phillip


#8

Hey Phillip,

I’ve been through a similar situation recently, and I think the
simplest way to make it work is to use a RegexpAnalyzer that takes
every character for a token. Mind this will have a negative impact on
the quality of your search results. Try this:

BEGIN
#!/usr/bin/ruby

require ‘rubygems’
require ‘ferret’

include Ferret

analyzer = Analysis::RegExpAnalyzer.new(/./, false)

i = Index::Index.new(:analyzer => analyzer)

i << { :content => “^µÂ¹ú¿Æ¡´óѧ£¬±±¾©´óѧ£¬Ç廪´óѧ£¬Í¬¼Ã´óѧ, University of Cologne” }

puts i.search(’¿Æ¡’)
puts i.search('University)
puts i.search(‘of’)
END


#9

… I think the simplest way to make it work is to use a RegexpAnalyzer that takes
every character for a token.

David’s code uses StandardAnalyzer. It’s implemented in C and is fast
and advanced. I don’t want to re-invent the wheel (e.g. www.example.com,
emails, punctuation etc.). PerFieldAnalyzer is not a good solution for
me too (I have mixed text). Persian is very similar to English, in
punctuations (it has some extra marks), word foundation, and even stems.


#10

That’s why it was mentioned as the simplest way, not the best way
performance-wise. It’s worth mentioning I’m using RegExpAnalyzer to
index some information in a hundreds of thousands documents sized
index. I’m not hitting any roofs in terms of memory usage or
performance.

StandardAnalyzer relies on spaces to find tokens, also taking stop
words, hyphens into consideration, right? Do correct me if I’m wrong.
I don’t know how Persian “works”, but if you have any expression
that’s not space separated, unless you’re fortunate enough that your
users queried for it entirely, they won’t get any results back.

The best solution for mixed text scenario, as far as I can tell, is to
have an analyzer that’s complex enough to find out the language for
every character/word, and apply some sort of sub-analyzer for each
language it finds. This might require you to perform many passes
through the same string.

So to sum it up, it’s not a matter of reinventing the wheel. It’s a
quick hack that will get you imprecise results sometimes, but will
work with mixed text for sure, since your analyzer doesn’t assume any
“westernisms” to be there when tokenizing text.


#11

So to sum it up, it’s not a matter of reinventing the wheel. It’s a
quick hack that will get you imprecise results sometimes, but will
work with mixed text for sure, since your analyzer doesn’t assume any
“westernisms” to be there when tokenizing text.

I think we’re missing the point here. The problem is that David’s code
uses StandardAnalyzer and it works for him, not for me and Phillip.
I have to write my own Analyzer, Stemfilter and StopFilter for Persian.
If StandardAnalyzer (although partially for Persian) works, I won’t have
extra overhead of using RegExpAnalyzer for common tokenizing of Persian
and Latin context.