Forum: Ferret Weird analyzer issue with the word 'fly'

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
73c04e9ef9ca435c5b19a2e765ae6d20?d=identicon&s=25 Max Williams (max-williams)
on 2009-04-09 13:45
Hi all

I'm using a_a_f in rails with a StemmingAnalyzer, in the index and in my
search.  I got the idea from this topic:
http://www.ruby-forum.com/topic/80178

I'm having a problem with some search terms - i narrowed one of them
down to the inclusion of the word 'fly'.  Can anyone give me any clues
at to what might be happening, or even how i can investigate?

My index is set up like this:

  acts_as_ferret({ :store_class_name => true,
                   :analyzer => Ferret::Analysis::StemmingAnalyzer.new,
                   :fields => {:name =>            { :boost => 2.0 },
                               ...
                }})

And this analyzer is defined in a module thus:

module Ferret::Analysis
  class StemmingAnalyzer
    def token_stream(field, text)
      StemFilter.new(StandardTokenizer.new(text))
    end
  end
end


Now, here's a search without using the analyzer:

>> TeachingObject.find_with_ferret("flea fly", :per_page => 2000).size
=> 14

And with the analyzer:

>> TeachingObject.find_with_ferret("flea fly", :per_page => 2000, :analyzer => 
Ferret::Analysis::StemmingAnalyzer.new).size
=> 0

Now, for other searches, the analyzer seems to be doing it's job nicely.
EG i have lots of resources with the word 'brass'.  With the analyzer, a
search for 'brasses' brings all these resources back, while without the
analyzer i don't get any of them:  that's all fine, it's working out
that 'brasses' and 'brass' are equivalent searches.

So what's going on with the word 'fly'?  It's definitely this word
because if i change one of the "flea fly" resources to be called "flea
walk" then a search for 'flea walk' brings it back, as does a search for
'flea walks'.

I'm guessing that the analyzer takes a word and converts it into other
terms, or some symbols or something, and searches with that combined
set, and during this process the orginal word 'fly' gets lost somewhere.
But, i don't know where to look to monitor this process.

Any help/advice/clues very welcome...

thanks
max
73c04e9ef9ca435c5b19a2e765ae6d20?d=identicon&s=25 Max Williams (max-williams)
on 2009-04-09 14:13
Just a bit more info - i started to look at what's going on in the
analyzer by putting a bit of logging in:

module Ferret::Analysis
  class StemmingAnalyzer
    def token_stream(field, text)
      RAILS_DEFAULT_LOGGER.debug "SEARCHING, field = #{field}, text =
#{text}"
      StemFilter.new(StandardTokenizer.new(text))
    end
  end
end

And, i see these results for a single search on "flea fly":

SEARCHING, field = property_ancestor_names, text = flea
SEARCHING, field = description, text = flea
SEARCHING, field = name, text = flea
SEARCHING, field = keyword_string, text = flea
SEARCHING, field = property_ids_string, text = flea
SEARCHING, field = property_names, text = flea
SEARCHING, field = unaccented_name, text = flea
SEARCHING, field = property_titles, text = flea
SEARCHING, field = resource_id, text = flea

One call to token_stream for each of my indexed methods, but with each
only using the first word of the search!  Now i'm even more confused...
36feb4959db6ab8259a44962f0fa761f?d=identicon&s=25 Jens Krämer (jkraemer)
on 2009-04-09 15:59
(Received via mailing list)
Hi Max!

On 09.04.2009, at 13:45, Max Williams wrote:
>
> I'm having a problem with some search terms - i narrowed one of them
> down to the inclusion of the word 'fly'.  Can anyone give me any clues
> at to what might be happening, or even how i can investigate?

First of all I'd have a look at what the analyzer does to your query
terms:

ts = StemmingAnalyzer.new.token_stream nil, 'flea fly'
while token = ts.next
  puts token
end

For some reason the word 'fly' is turned into 'fli' by the analyzer.
But that's ok, as long as it works the same way at indexing time. Next
use the ferret_browser tool to inspect your index and check whether
the term 'fli' really appears in your index. I doubt that, because if
this was the case everything would work as expected. So I guess we
have a problem with the analysis at indexing time.

> My index is set up like this:
>
> acts_as_ferret({ :store_class_name => true,
>                  :analyzer => Ferret::Analysis::StemmingAnalyzer.new,
>                  :fields => {:name =>            { :boost => 2.0 },
>                              ...
>               }})

now that I look at this the second time the problem seems quite
obvious :-) The analyzer option needs to be given as part of a
separate ferret options hash like this:

acts_as_ferret :store_class_name => true,
               :ferret => { :analyzer =>
Ferret::Analysis::StemmingAnalyzer.new },
               :fields => { ... }

rebuild your index and everything should be working as expected.


Cheers,
Jens


--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
73c04e9ef9ca435c5b19a2e765ae6d20?d=identicon&s=25 Max Williams (max-williams)
on 2009-04-09 16:56
(Received via mailing list)
2009/4/9 Jens Kraemer <jk@jkraemer.net>

> Hi Max!

Hi Jens, thanks for responding so quickly.

>
>
>
> For some reason the word 'fly' is turned into 'fli' by the analyzer.


Indeed it is:
>> ts = Ferret::Analysis::StemmingAnalyzer.new.token_stream nil, 'flea fly'
=> #<Ferret::Analysis::StemFilter:0xb48b3b48>
>> while token = ts.next
>>  puts token
>> end
token["flea":0:4:1]
token["fli":5:8:1]



> But that's ok, as long as it works the same way at indexing time. Next use
> the ferret_browser tool to inspect your index and check whether the term
> 'fli' really appears in your index


I've not seen this tool before, it sounds useful - would you mind
pointing
me at some docs for it?   I can find the class in the ferret rdoc but
there's no explanation for it as far as i can see.


> acts_as_ferret :store_class_name => true,
>              :ferret => { :analyzer =>
> Ferret::Analysis::StemmingAnalyzer.new },
>              :fields => { ... }
>
> rebuild your index and everything should be working as expected.


It is indeed!   Thanks very much Jens, i really appreciate the support.

Hope you have a great easter weekend!
cheers
max
36feb4959db6ab8259a44962f0fa761f?d=identicon&s=25 Jens Krämer (jkraemer)
on 2009-04-09 22:04
(Received via mailing list)
Hi!

On 09.04.2009, at 16:29, Max Williams wrote:
[..]
>
> I've not seen this tool before, it sounds useful - would you mind
> pointing me at some docs for it?   I can find the class in the
> ferret rdoc but there's no explanation for it as far as i can see.

ferret_browser is a standalone web application that gets installed
along with ferret. Just run it with
ferret_browser path/to/index
and point your browser to the url shown in the output. should be
pretty self explaining then.

>
> Hope you have a great easter weekend!

Thank you, and the same to you!


Cheers,
Jens


--
Jens Krämer
Finkenlust 14, 06449 Aschersleben, Germany
VAT Id DE251962952
http://www.jkraemer.net/ - Blog
http://www.omdb.org/     - The new free film database
This topic is locked and can not be replied to.