Weird analyzer issue with the word 'fly'

Hi all

I’m using a_a_f in rails with a StemmingAnalyzer, in the index and in my
search. I got the idea from this topic:
http://www.ruby-forum.com/topic/80178

I’m having a problem with some search terms - i narrowed one of them
down to the inclusion of the word ‘fly’. Can anyone give me any clues
at to what might be happening, or even how i can investigate?

My index is set up like this:

acts_as_ferret({ :store_class_name => true,
:analyzer => Ferret::Analysis::StemmingAnalyzer.new,
:fields => {:name => { :boost => 2.0 },

}})

And this analyzer is defined in a module thus:

module Ferret::Analysis
class StemmingAnalyzer
def token_stream(field, text)
StemFilter.new(StandardTokenizer.new(text))
end
end
end

Now, here’s a search without using the analyzer:

TeachingObject.find_with_ferret(“flea fly”, :per_page => 2000).size
=> 14

And with the analyzer:

TeachingObject.find_with_ferret(“flea fly”, :per_page => 2000, :analyzer => Ferret::Analysis::StemmingAnalyzer.new).size
=> 0

Now, for other searches, the analyzer seems to be doing it’s job nicely.
EG i have lots of resources with the word ‘brass’. With the analyzer, a
search for ‘brasses’ brings all these resources back, while without the
analyzer i don’t get any of them: that’s all fine, it’s working out
that ‘brasses’ and ‘brass’ are equivalent searches.

So what’s going on with the word ‘fly’? It’s definitely this word
because if i change one of the “flea fly” resources to be called “flea
walk” then a search for ‘flea walk’ brings it back, as does a search for
‘flea walks’.

I’m guessing that the analyzer takes a word and converts it into other
terms, or some symbols or something, and searches with that combined
set, and during this process the orginal word ‘fly’ gets lost somewhere.
But, i don’t know where to look to monitor this process.

Any help/advice/clues very welcome…

thanks
max

Just a bit more info - i started to look at what’s going on in the
analyzer by putting a bit of logging in:

module Ferret::Analysis
class StemmingAnalyzer
def token_stream(field, text)
RAILS_DEFAULT_LOGGER.debug “SEARCHING, field = #{field}, text =
#{text}”
StemFilter.new(StandardTokenizer.new(text))
end
end
end

And, i see these results for a single search on “flea fly”:

SEARCHING, field = property_ancestor_names, text = flea
SEARCHING, field = description, text = flea
SEARCHING, field = name, text = flea
SEARCHING, field = keyword_string, text = flea
SEARCHING, field = property_ids_string, text = flea
SEARCHING, field = property_names, text = flea
SEARCHING, field = unaccented_name, text = flea
SEARCHING, field = property_titles, text = flea
SEARCHING, field = resource_id, text = flea

One call to token_stream for each of my indexed methods, but with each
only using the first word of the search! Now i’m even more confused…

Hi Max!

On 09.04.2009, at 13:45, Max W. wrote:

I’m having a problem with some search terms - i narrowed one of them
down to the inclusion of the word ‘fly’. Can anyone give me any clues
at to what might be happening, or even how i can investigate?

First of all I’d have a look at what the analyzer does to your query
terms:

ts = StemmingAnalyzer.new.token_stream nil, ‘flea fly’
while token = ts.next
puts token
end

For some reason the word ‘fly’ is turned into ‘fli’ by the analyzer.
But that’s ok, as long as it works the same way at indexing time. Next
use the ferret_browser tool to inspect your index and check whether
the term ‘fli’ really appears in your index. I doubt that, because if
this was the case everything would work as expected. So I guess we
have a problem with the analysis at indexing time.

My index is set up like this:

acts_as_ferret({ :store_class_name => true,
:analyzer => Ferret::Analysis::StemmingAnalyzer.new,
:fields => {:name => { :boost => 2.0 },

}})

now that I look at this the second time the problem seems quite
obvious :slight_smile: The analyzer option needs to be given as part of a
separate ferret options hash like this:

acts_as_ferret :store_class_name => true,
:ferret => { :analyzer =>
Ferret::Analysis::StemmingAnalyzer.new },
:fields => { … }

rebuild your index and everything should be working as expected.

Cheers,
Jens


Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database

2009/4/9 Jens K. [email protected]

Hi Max!

Hi Jens, thanks for responding so quickly.

For some reason the word ‘fly’ is turned into ‘fli’ by the analyzer.

Indeed it is:

ts = Ferret::Analysis::StemmingAnalyzer.new.token_stream nil, ‘flea fly’
=> #Ferret::Analysis::StemFilter:0xb48b3b48
while token = ts.next
puts token
end
token[“flea”:0:4:1]
token[“fli”:5:8:1]

But that’s ok, as long as it works the same way at indexing time. Next use
the ferret_browser tool to inspect your index and check whether the term
‘fli’ really appears in your index

I’ve not seen this tool before, it sounds useful - would you mind
pointing
me at some docs for it? I can find the class in the ferret rdoc but
there’s no explanation for it as far as i can see.

acts_as_ferret :store_class_name => true,
:ferret => { :analyzer =>
Ferret::Analysis::StemmingAnalyzer.new },
:fields => { … }

rebuild your index and everything should be working as expected.

It is indeed! Thanks very much Jens, i really appreciate the support.

Hope you have a great easter weekend!
cheers
max

Hi!

On 09.04.2009, at 16:29, Max W. wrote:
[…]

I’ve not seen this tool before, it sounds useful - would you mind
pointing me at some docs for it? I can find the class in the
ferret rdoc but there’s no explanation for it as far as i can see.

ferret_browser is a standalone web application that gets installed
along with ferret. Just run it with
ferret_browser path/to/index
and point your browser to the url shown in the output. should be
pretty self explaining then.

Hope you have a great easter weekend!

Thank you, and the same to you!

Cheers,
Jens


Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/ - The new free film database