Stop words, fields, StandardAnalyzer quagmire


#1

Hello,

I’m using: Ruby 1.8.6, Rails 1.2.3, ferret 0.11.4, acts_as_ferret from
svn stable.

I’ve had quite a day wrestling with trying to remove the use of
stopwords. The problem was that when searching for words like “no” or
“the”, no results were found. I found a confusing thing behavior that
has taken me some time to figure out, and I hope sharing it saves
someone else some time.

From searching around online and in the source code I came up with the
following config in my ActiveRecord model:

acts_as_ferret({:fields => {:name => {:boost => 10},
:type => {:boost => 2},
:email => {:boost => 10},
:bio => {:store => :no},
:status_id => {:boost => 1}},
:store_class_name => true,
:remote => true,
:ferret => { :analyzer =>
Ferret::Analysis::StandardAnalyzer.new([]) }
} )

With the StandardAnalyzer added, I do find results with “no” or “the”.
The complicating factor is that as you can see, I have a field
“status_id”. This field lets me filter for profiles that are
published or draft in my CMS.

Before I added the StandardAnalyzer, the status_id field worked fine
in queries like this:

a = Profile.find_by_contents(“smith status_id:100”)
a.total_hits
=> 2 # this is correct, only 2 are published

a = Profile.find_by_contents(“smith”)
a.total_hits
=> 4 # this is correct, there are 4 total

So, you can see that the status_id was automatically “AND”-ed to the
query word.

However, after adding the above StandardAnalyzer config, the status_id
was now “OR”-ed, like so:

a = Profile.find_by_contents(“no”)
a.total_hits
=> 5 # this is good

a = Profile.find_by_contents(“no status_id:100”)
a.total_hits
=> 208 # this is bad – it’s the same as if I only searched for
status_id:100.

a = Profile.find_by_contents(“smith status_id:100”)
a.total_hits
=> 208 # this is just as bad – it’s the same as if I only searched
for status_id:100.

The fix here is to add the AND keyword explicitly to the query:

a = Profile.find_by_contents(“smith AND status_id:100”)
a.total_hits
=> 2 # works just like before.

In fact, OR becomes the default search regardless of whether I use a
field in the query:

a = Profile.find_by_contents(“smith jones”)
a.total_hits
=> 5 # OR’ed results

a = Profile.find_by_contents(“smith AND jones”)
a.total_hits
=> 0

Again, before StandardAnalyzer, “AND” was the default so the first
“smith jones” query would have returned 0 as it should.

Any insight as to why this might be? I would prefer AND to be the
default.

Thanks,

Doug


#2

Hi!

On Fri, May 04, 2007 at 05:50:39PM -0700, Doug S. wrote:

Hello,

I’m using: Ruby 1.8.6, Rails 1.2.3, ferret 0.11.4, acts_as_ferret from
svn stable.
[…]

With the StandardAnalyzer added, I do find results with “no” or “the”.
The complicating factor is that as you can see, I have a field
“status_id”. This field lets me filter for profiles that are
published or draft in my CMS.

[…]

In fact, OR becomes the default search regardless of whether I use a
field in the query:

[…]

Again, before StandardAnalyzer, “AND” was the default so the first
“smith jones” query would have returned 0 as it should.

Any insight as to why this might be? I would prefer AND to be the default.

Then you shouldn’t override acts_as_ferret’s default behaviour by
using the completely unsupported and only internally used :ferret option
:slight_smile:

I admit that this is a bug in how aaf handles it’s parameters and I’ll
fix this, however for thetime being you can use this statement which
should work as intended:

acts_as_ferret({ :fields => {:name => {:boost => 10},
:type => {:boost => 2},
:email => {:boost => 10},
:bio => {:store => :no},
:status_id => {:boost => 1}},
:store_class_name => true,
:remote => true
}, {
:analyzer => Ferret::Analysis::StandardAnalyzer.new([])
})

Please note the difference: the analyzer option is part of a second
options hash.

The reason for this separation is that AAF more or less passes the last
hash directly to Ferret, while the first option hash is used for aaf
options Ferret itself doesn’t know about.

However I plan to rework this in the Future so then your original
statement
should work correctly then. Btw, where did you find that solution? I’ve
never seen the :ferret option being used outside aaf before.

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa


#3

On Tue, May 08, 2007 at 11:35:48AM +0200, Jens K. wrote:
[…]

I just committed a fix so that the above call should be working
correctly now. I’d go so far to say that this should be the preferred
way of passing ferret options to aaf now. The two-hash calling style I
suggested below will still work of course, so nothing should break.

Thoughts anyone?

Old calling style:

           })

Please note the difference: the analyzer option is part of a second
options hash.


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
removed_email_address@domain.invalid | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa


#4

On 5/8/07, Jens K. removed_email_address@domain.invalid wrote:

Hi!

However I plan to rework this in the Future so then your original statement
should work correctly then. Btw, where did you find that solution? I’ve
never seen the :ferret option being used outside aaf before.

Hi Jens,

Thank you for your fast response. I found this as an option by
searching through the aaf source code. There was a commented out
version of it in act_methods.rb, the acts_as_ferret() method.

I’ll try your latest change and let you know how it works.

Thanks again,

Doug


#5

On 5/8/07, Jens K. removed_email_address@domain.invalid wrote:

Ferret::Analysis::StandardAnalyzer.new([]) }
} )

I just committed a fix so that the above call should be working
correctly now. I’d go so far to say that this should be the preferred
way of passing ferret options to aaf now. The two-hash calling style I
suggested below will still work of course, so nothing should break.

Hi Jens,

This is excellent. It works well in my initial testing. I think it’s
a great way to go.

Thanks for your great support,

Doug