Missing terms in index causing search errors

rami · August 20, 2006, 6:53am

I am unable to find results for models when one or more of the terms are
not being indexed.

Lets suppose I index a User on the phrase “Ruby on Rails.” If I then
search using User.find_by_contents(“Ruby on Rails”) I get no results,
since “or” is a common term and does not get indexed. Of course,
User.find_by_contents(“Ruby R.”) works just fine.

I would like to find a way to search for terms such as “Ruby on Rails”
and have the query analyzer automatically ignore tokens (ie, “or”) that
the indexer would normally avoid. Any thoughts on how to go about
solving this?

Rami

rami · August 22, 2006, 12:24am

On Sun, Aug 20, 2006 at 06:53:30AM +0200, Rami wrote:

I am unable to find results for models when one or more of the terms are
not being indexed.

Lets suppose I index a User on the phrase “Ruby on Rails.” If I then
search using User.find_by_contents(“Ruby on Rails”) I get no results,
since “or” is a common term and does not get indexed. Of course,
User.find_by_contents(“Ruby R.”) works just fine.

this shouldn’t happen. Do you build your index through acts_as_ferret ?

The cause of your problem seems to be that there’s a different anylyzer
in use for query parsing than the one that was used for building the
index. usually queries should get analyzed the same way as contents to
avoid those problems.

I would like to find a way to search for terms such as “Ruby on Rails”
and have the query analyzer automatically ignore tokens (ie, “or”) that
the indexer would normally avoid. Any thoughts on how to go about
solving this?

try to specify an analyzer in your call to acts_as_ferret:

acts_as_ferret( { :fields => [ … field list, may be a hash, too ] },
{ :analyzer => Ferret::Analysis::StopAnalyzer.new } )

Please let me know if this helps.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

rami · August 22, 2006, 9:35am

Lets suppose I index a User on the phrase “Ruby on Rails.” If I then
search using User.find_by_contents(“Ruby on Rails”) I get no results,
since “or” is a common term and does not get indexed. Of course,
User.find_by_contents(“Ruby R.”) works just fine.

this shouldn’t happen. Do you build your index through acts_as_ferret ?

hey …

i had the same problem… using ferret, not acts_as_ferret… the
stopwords
are described here:
http://ferret.davebalmain.com/api/classes/Ferret/Analysis/StopAnalyzer.html

what I do is to remove all stopwords from the query before searching…

def self.filter_stop_words( q )
query = q.split(" “)
query.delete_if { |w| Indexer::STOP_WORDS.include?( w.downcase )
}.join(” ")
end

Ben

rami · August 22, 2006, 8:31pm

On 8/22/06, Benjamin K. [email protected] wrote:

i had the same problem… using ferret, not acts_as_ferret… the stopwords

Ben

Hi Ben,

This shouldn’t be necessary. What Jens said is correct. If you use the
same analyzer in your indexer as you use in your query parser then a
search for “Ruby on Rails” should work. If you use the Index::Index
class this will be handled for you.

Cheers,
Dave

rami · August 23, 2006, 12:06am

On Wed, Aug 23, 2006 at 03:30:46AM +0900, David B. wrote:

On 8/22/06, Benjamin K. [email protected] wrote:

Lets suppose I index a User on the phrase “Ruby on Rails.” If I then
search using User.find_by_contents(“Ruby on Rails”) I get no results,
since “or” is a common term and does not get indexed. Of course,
User.find_by_contents(“Ruby R.”) works just fine.

[…]

This shouldn’t be necessary. What Jens said is correct. If you use the
same analyzer in your indexer as you use in your query parser then a
search for “Ruby on Rails” should work. If you use the Index::Index
class this will be handled for you.

As this problem seems to be fairly common recently, I did some tests and
I think I found a common pattern that seems to lead to wrong query
analyzing when using the Index::Index class:

def test_stopwords
i = Ferret::Index::Index.new(
:occur_default =>
Ferret::Search::BooleanClause::Occur::MUST,
:default_search_field => ‘*’)
d = Ferret::Document::Document.new

# adding this additional field to the document leads to failure

below
# comment out this statement and all tests pass:
d << Ferret::Document::Field.new(‘id’, ‘1’,
Ferret::Document::Field::Store::YES,
Ferret::Document::Field::Index::UNTOKENIZED)

d << Ferret::Document::Field.new('content', 'Move or shake',
                                 Ferret::Document::Field::Store::NO,
                                 Ferret::Document::Field::Index::TOKENIZED,
                                 Ferret::Document::Field::TermVector::NO,
                                 false, 1.0)
i << d
hits = i.search 'move nothere shake'
assert_equal 0, hits.size
hits = i.search 'move shake'
assert_equal 1, hits.size
hits = i.search 'move or shake'
assert_equal 1, hits.size # fails when id field is present

end

the id field is constructed just like we do it in aaf. I tried some
variations of the way the field is constructed (another name, other
flags), but as soon as there is more than one field, the test doesn’t
work any more.

Setting the default_search_field to ‘content’ makes the tests pass, btw.

Dave, any suggestions ?

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

rami · August 23, 2006, 7:32am

On 8/23/06, Jens K. [email protected] wrote:

This shouldn’t be necessary. What Jens said is correct. If you use the
:occur_default => Ferret::Search::BooleanClause::Occur::MUST,
Ferret::Document::Field::Store::NO,
end

the id field is constructed just like we do it in aaf. I tried some
variations of the way the field is constructed (another name, other
flags), but as soon as there is more than one field, the test doesn’t
work any more.

Setting the default_search_field to ‘content’ makes the tests pass, btw.

Dave, any suggestions ?

Thanks Jens,

This was a bug after all at it was very easy to find and fix with your
example/bug-report. Thanks. I’ve just put out a gem; version 0.9.6.
This will be compatible with acts_as_ferret. I’ll try and find time to
write a patch for acts_as_ferret to work with 0.10.0 but hopefully
you’ll beat me to it. The documentation is a little more thorough than
previous versions of Ferret but it still requires a bit of work,
especially considering there is no-longer any Ruby source to work
from. Let me know if you have any questions.

Cheers,
Dave

rami · August 22, 2006, 9:17pm

Hey…

This shouldn’t be necessary. What Jens said is correct. If you use the
same analyzer in your indexer as you use in your query parser then a
search for “Ruby on Rails” should work. If you use the Index::Index
class this will be handled for you.

i do not use any ‘non-default’ analysers yet… but still got the
problem…
i even got the problem that i wanted to search for a phrase that was
build
completely on stop words… and it did not find anything …

however, i will give it a 2nd look with 0.10 and maybe i did miss
something…

Ben

rami · August 23, 2006, 11:31am

On Wed, Aug 23, 2006 at 02:30:56PM +0900, David B. wrote:

i = Ferret::Index::Index.new(
d << Ferret::Document::Field.new('content', 'Move or shake',
assert_equal 1, hits.size # fails when id field is present
Dave, any suggestions ?
from. Let me know if you have any questions.
works great, thanks for the quick fix. I’ll start working on a 0.10.0
compatible version of aaf now, I’ll keep you up to date on my progress.

The latest (and last) aaf version to work with Ferret 0.9.x series is
0.2.3, located at

svn://projects.jkraemer.net/acts_as_ferret/tags/0.2.3

Please note the changed base URL, I decided to leave out the ‘plugin’
directory below ‘tags’ from now on.

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66