Forum: Ferret Ferret and non latin characters support

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Feeb6fd634b133026e7eb2454028fac4?d=identicon&s=25 Reza Esmily (reza)
on 2007-04-08 03:07
I've successfully installed ferret and acts_as_ferret and have no
problem with utf-8 for accented characters. It returns correct results
fot e.g. français. My problem is with non latin characters (Persian
indeed). I have tested different locales with no success both on Debian
and Mac. Any idea?
(ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6)
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-04-09 08:12
(Received via mailing list)
On 4/8/07, Reza Yeganeh <yeganeh.reza@gmail.com> wrote:
> I've successfully installed ferret and acts_as_ferret and have no
> problem with utf-8 for accented characters. It returns correct results
> fot e.g. français. My problem is with non latin characters (Persian
> indeed). I have tested different locales with no success both on Debian
> and Mac. Any idea?
> (ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6)

Hi Reza,

I'm afraid I have no experience with Persian text. If you send me an
example of some text I'll have a look and see what I can do.

Cheers,
Dave
Feeb6fd634b133026e7eb2454028fac4?d=identicon&s=25 Reza Esmily (reza)
on 2007-04-09 11:11
David Balmain wrote:
> I'm afraid I have no experience with Persian text. If you send me an
> example of some text I'll have a look and see what I can do.

Hi David,
This is not specific to Persian as I tested with more languages (Hebrew,
Japanese...). By the way this is a persian sample:
شکرشکن شوند همه طوطیان هند. زین قند پارسی که 
به بنگاله می‌رود.

Thanks,
Reza
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-04-16 15:25
(Received via mailing list)
On 4/9/07, Reza Yeganeh <yeganeh.reza@gmail.com> wrote:
> David Balmain wrote:
> > I'm afraid I have no experience with Persian text. If you send me an
> > example of some text I'll have a look and see what I can do.
>
> Hi David,
> This is not specific to Persian as I tested with more languages (Hebrew,
> Japanese...). By the way this is a persian sample:
> شکرشکن شوند همه طوطیان هند. زین قند پارسی که به 
بنگاله میرود.

Hi Reza,

Here is my test code;

    require 'rubygems'
    require 'ferret'

    text = "شکرشکن شوند همه طوطیان هند. زین قند 
پارسی که به بنگاله
میرود."
    include Ferret::Analysis
    tokenizer = StandardAnalyzer.new.token_stream(:field, text)
    while token = tokenizer.next
      puts token
    end

And this is what I got as the output;

    token["شکرشکن":0:12:1]
    token["شوند":13:21:1]
    token["همه":22:28:1]
    token["طوطیان":29:41:1]
    token["هند":42:48:1]
    token["زین":50:56:1]
    token["قند":57:63:1]
    token["پارسی":64:74:1]
    token["Ú©Ù‡":75:79:1]
    token["به":80:84:1]
    token["بنگاله":85:97:1]
    token["میرود":98:108:1]

I guess this is probably the same as what you got but I'm not exactly
sure what is wrong with it. If you could explain what it should be
doing then I may be able to work out what is wrong.

Cheers,
Dave
Feeb6fd634b133026e7eb2454028fac4?d=identicon&s=25 Reza Esmily (reza)
on 2007-04-18 04:22
>     tokenizer = StandardAnalyzer.new.token_stream(:field, text)

Thanks Dave,
but StandardAnalyzer doesn't work for me for non-latin text (tokenizer
returns nil). I tested with edge Ferret and tried different
Ferret.locale. Can you guess what's wrong?

ruby 1.8.4 (2005-12-24) [powerpc-darwin8.6.0],
powerpc-apple-darwin8-gcc-4.0.1

Best,
Reza
E5f0f1587d8356e859354e9b9cebc762?d=identicon&s=25 Phillip Oertel (phillipoertel)
on 2007-04-22 00:10
i am seeing the same problem as reza - tokenizer.next returns nil.

another sample

text = "^德国科隆大学,北京大学,清华大学,同济大学, 
University of Cologne"

returns only:
token["university":66:76:1]
token["cologne":80:87:2]


ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-darwin8.8.2]
ferret 0.11.4

kind regards,
phillip
E5f0f1587d8356e859354e9b9cebc762?d=identicon&s=25 Phillip Oertel (phillipoertel)
on 2007-04-22 00:17
same problem on our debian servers :-(

* ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-linux]
* Linux s15215947 2.6.16-rc6-060319a #1 SMP Sun Mar 19 16:28:15 CET 2006
i686 GNU/Linux

kind regards,
phillip
F0bf457915a11e25d6744d02362511aa?d=identicon&s=25 Julio Cesar Ody (Guest)
on 2007-04-23 01:29
(Received via mailing list)
Hey Phillip,

I've been through a similar situation recently, and I think the
simplest way to make it work is to use a RegexpAnalyzer that takes
every character for a token. Mind this will have a negative impact on
the quality of your search results. Try this:

__BEGIN__
#!/usr/bin/ruby

require 'rubygems'
require 'ferret'

include Ferret

analyzer = Analysis::RegExpAnalyzer.new(/./, false)

i = Index::Index.new(:analyzer => analyzer)

i << { :content => "^µÂ¹ú¿Æ¡´óѧ£¬±±¾©´óѧ£¬Ç廪´óѧ£¬Í¬¼Ã´óѧ, University of 
Cologne" }

puts i.search('¿Æ¡')
puts i.search('University)
puts i.search('of')
__END__
Feeb6fd634b133026e7eb2454028fac4?d=identicon&s=25 Reza Esmily (reza)
on 2007-04-23 01:42
> ... I think the simplest way to make it work is to use a RegexpAnalyzer that takes
> every character for a token.

David's code uses StandardAnalyzer. It's implemented in C and is fast
and advanced. I don't want to re-invent the wheel (e.g. www.example.com,
emails, punctuation etc.). PerFieldAnalyzer is not a good solution for
me too (I have mixed text). Persian is very similar to English, in
punctuations (it has some extra marks), word foundation, and even stems.
F0bf457915a11e25d6744d02362511aa?d=identicon&s=25 Julio Cesar Ody (Guest)
on 2007-04-23 01:59
(Received via mailing list)
That's why it was mentioned as the simplest way, not the best way
performance-wise. It's worth mentioning I'm using RegExpAnalyzer to
index some information in a hundreds of thousands documents sized
index. I'm not hitting any roofs in terms of memory usage or
performance.

StandardAnalyzer relies on spaces to find tokens, also taking stop
words, hyphens into consideration, right? Do correct me if I'm wrong.
I don't know how Persian "works", but if you have any expression
that's not space separated, unless you're fortunate enough that your
users queried for it entirely, they won't get any results back.

The best solution for mixed text scenario, as far as I can tell, is to
have an analyzer that's complex enough to find out the language for
every character/word, and apply some sort of sub-analyzer for each
language it finds. This might require you to perform many passes
through the same string.

So to sum it up, it's not a matter of reinventing the wheel. It's a
quick hack that will get you imprecise results sometimes, but will
work with mixed text for sure, since your analyzer doesn't assume any
"westernisms" to be there when tokenizing text.
Feeb6fd634b133026e7eb2454028fac4?d=identicon&s=25 Reza Esmily (reza)
on 2007-04-23 05:25
> So to sum it up, it's not a matter of reinventing the wheel. It's a
> quick hack that will get you imprecise results sometimes, but will
> work with mixed text for sure, since your analyzer doesn't assume any
> "westernisms" to be there when tokenizing text.

I think we're missing the point here. The problem is that David's code
uses StandardAnalyzer and it works for him, not for me and Phillip.
I have to write my own Analyzer, Stemfilter and StopFilter for Persian.
If StandardAnalyzer (although partially for Persian) works, I won't have
extra overhead of using RegExpAnalyzer for common tokenizing of Persian
and Latin context.
This topic is locked and can not be replied to.