Ferret and non latin characters support

reza · April 8, 2007, 3:07am

I’ve successfully installed ferret and acts_as_ferret and have no
problem with utf-8 for accented characters. It returns correct results
fot e.g. franÃ§ais. My problem is with non latin characters (Persian
indeed). I have tested different locales with no success both on Debian
and Mac. Any idea?
(ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6)

reza · April 9, 2007, 11:11am

David B. wrote:

I’m afraid I have no experience with Persian text. If you send me an
example of some text I’ll have a look and see what I can do.

Hi David,
This is not specific to Persian as I tested with more languages (Hebrew,
Japanese…). By the way this is a persian sample:
Ø´Ú©Ø±Ø´Ú©Ù† Ø´ÙˆÙ†Ø¯ Ù‡Ù…Ù‡ Ø·ÙˆØ·ÛŒØ§Ù† Ù‡Ù†Ø¯. Ø²ÛŒÙ† Ù‚Ù†Ø¯ Ù¾Ø§Ø±Ø³ÛŒ Ú©Ù‡ Ø¨Ù‡ Ø¨Ù†Ú¯Ø§Ù„Ù‡ Ù…ÛŒâ€ŒØ±ÙˆØ¯.

Thanks,
Reza

reza · April 9, 2007, 8:12am

On 4/8/07, Reza Y. [email protected] wrote:

I’ve successfully installed ferret and acts_as_ferret and have no
problem with utf-8 for accented characters. It returns correct results
fot e.g. français. My problem is with non latin characters (Persian
indeed). I have tested different locales with no success both on Debian
and Mac. Any idea?
(ferret 0.11.4, acts_as_ferret 0.4.0, rails 1.1.6)

Hi Reza,

I’m afraid I have no experience with Persian text. If you send me an
example of some text I’ll have a look and see what I can do.

Cheers,
Dave

reza · April 16, 2007, 3:25pm

On 4/9/07, Reza Y. [email protected] wrote:

David B. wrote:

I’m afraid I have no experience with Persian text. If you send me an
example of some text I’ll have a look and see what I can do.

Hi David,
This is not specific to Persian as I tested with more languages (Hebrew,
Japanese…). By the way this is a persian sample:
Ø´Ú©Ø±Ø´Ú©Ù† Ø´ÙˆÙ†Ø¯ Ù‡Ù…Ù‡ Ø·ÙˆØ·ÛŒØ§Ù† Ù‡Ù†Ø¯. Ø²ÛŒÙ† Ù‚Ù†Ø¯ Ù¾Ø§Ø±Ø³ÛŒ Ú©Ù‡ Ø¨Ù‡ Ø¨Ù†Ú¯Ø§Ù„Ù‡ Ù…ÛŒØ±ÙˆØ¯.

Hi Reza,

Here is my test code;

require 'rubygems'
require 'ferret'

text = "Ø´Ú©Ø±Ø´Ú©Ù† Ø´ÙˆÙ†Ø¯ Ù‡Ù…Ù‡ Ø·ÙˆØ·ÛŒØ§Ù† Ù‡Ù†Ø¯. Ø²ÛŒÙ† Ù‚Ù†Ø¯ Ù¾Ø§Ø±Ø³ÛŒ Ú©Ù‡ Ø¨Ù‡ Ø¨Ù†Ú¯Ø§Ù„Ù‡

Ù…ÛŒØ±ÙˆØ¯."
include Ferret::Analysis
tokenizer = StandardAnalyzer.new.token_stream(:field, text)
while token = tokenizer.next
puts token
end

And this is what I got as the output;

token["Ø´Ú©Ø±Ø´Ú©Ù†":0:12:1]
token["Ø´ÙˆÙ†Ø¯":13:21:1]
token["Ù‡Ù…Ù‡":22:28:1]
token["Ø·ÙˆØ·ÛŒØ§Ù†":29:41:1]
token["Ù‡Ù†Ø¯":42:48:1]
token["Ø²ÛŒÙ†":50:56:1]
token["Ù‚Ù†Ø¯":57:63:1]
token["Ù¾Ø§Ø±Ø³ÛŒ":64:74:1]
token["Ú©Ù‡":75:79:1]
token["Ø¨Ù‡":80:84:1]
token["Ø¨Ù†Ú¯Ø§Ù„Ù‡":85:97:1]
token["Ù…ÛŒØ±ÙˆØ¯":98:108:1]

I guess this is probably the same as what you got but I’m not exactly
sure what is wrong with it. If you could explain what it should be
doing then I may be able to work out what is wrong.

Cheers,
Dave

reza · April 18, 2007, 4:22am

tokenizer = StandardAnalyzer.new.token_stream(:field, text)

Thanks Dave,
but StandardAnalyzer doesn’t work for me for non-latin text (tokenizer
returns nil). I tested with edge Ferret and tried different
Ferret.locale. Can you guess what’s wrong?

ruby 1.8.4 (2005-12-24) [powerpc-darwin8.6.0],
powerpc-apple-darwin8-gcc-4.0.1

Best,
Reza

reza · April 22, 2007, 12:10am

i am seeing the same problem as reza - tokenizer.next returns nil.

another sample

text = “^å¾·å›½ç§‘éš†å¤§å¦ï¼ŒåŒ—äº¬å¤§å¦ï¼Œæ¸…åŽå¤§å¦ï¼ŒåŒæµŽå¤§å¦, University of Cologne”

returns only:
token[“university”:66:76:1]
token[“cologne”:80:87:2]

ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-darwin8.8.2]
ferret 0.11.4

kind regards,
phillip

reza · April 22, 2007, 12:17am

same problem on our debian servers

ruby 1.8.5 (2006-12-25 patchlevel 12) [i686-linux]
Linux s15215947 2.6.16-rc6-060319a #1 SMP Sun Mar 19 16:28:15 CET 2006
i686 GNU/Linux

kind regards,
phillip

reza · April 23, 2007, 1:29am

Hey Phillip,

I’ve been through a similar situation recently, and I think the
simplest way to make it work is to use a RegexpAnalyzer that takes
every character for a token. Mind this will have a negative impact on
the quality of your search results. Try this:

BEGIN
#!/usr/bin/ruby

require ‘rubygems’
require ‘ferret’

include Ferret

analyzer = Analysis::RegExpAnalyzer.new(/./, false)

i = Index::Index.new(:analyzer => analyzer)

puts i.search(’¿ÆÂ¡’)
puts i.search('University)
puts i.search(‘of’)
END

reza · April 23, 2007, 1:42am

… I think the simplest way to make it work is to use a RegexpAnalyzer that takes
every character for a token.

David’s code uses StandardAnalyzer. It’s implemented in C and is fast
and advanced. I don’t want to re-invent the wheel (e.g. www.example.com,
emails, punctuation etc.). PerFieldAnalyzer is not a good solution for
me too (I have mixed text). Persian is very similar to English, in
punctuations (it has some extra marks), word foundation, and even stems.

reza · April 23, 2007, 1:59am

That’s why it was mentioned as the simplest way, not the best way
performance-wise. It’s worth mentioning I’m using RegExpAnalyzer to
index some information in a hundreds of thousands documents sized
index. I’m not hitting any roofs in terms of memory usage or
performance.

StandardAnalyzer relies on spaces to find tokens, also taking stop
words, hyphens into consideration, right? Do correct me if I’m wrong.
I don’t know how Persian “works”, but if you have any expression
that’s not space separated, unless you’re fortunate enough that your
users queried for it entirely, they won’t get any results back.

The best solution for mixed text scenario, as far as I can tell, is to
have an analyzer that’s complex enough to find out the language for
every character/word, and apply some sort of sub-analyzer for each
language it finds. This might require you to perform many passes
through the same string.

So to sum it up, it’s not a matter of reinventing the wheel. It’s a
quick hack that will get you imprecise results sometimes, but will
work with mixed text for sure, since your analyzer doesn’t assume any
“westernisms” to be there when tokenizing text.

reza · April 23, 2007, 5:25am

So to sum it up, it’s not a matter of reinventing the wheel. It’s a
quick hack that will get you imprecise results sometimes, but will
work with mixed text for sure, since your analyzer doesn’t assume any
“westernisms” to be there when tokenizing text.

I think we’re missing the point here. The problem is that David’s code
uses StandardAnalyzer and it works for him, not for me and Phillip.
I have to write my own Analyzer, Stemfilter and StopFilter for Persian.
If StandardAnalyzer (although partially for Persian) works, I won’t have
extra overhead of using RegExpAnalyzer for common tokenizing of Persian
and Latin context.