Index returns all results for specific queries


#1

Hey all,

I’m getting some really weird results when searching documents. It
seems to be somehow related to the document format I’m using.

I wrote a small script to replicate it:

################
#!/usr/bin/ruby

require ‘rubygems’
require ‘ferret’
include Ferret
index = Index::Index.new(:path => ‘/tmp/fooindex’, :key => :id)

dummy data

index << {:visibility=>“private”, :type=>“media”, :title=>“example
title”, :owner=>“user/3003”, :author=>“user/3003”,
:description=>“description example”, :id=>“user/3003/media/1”}
index << {:visibility=>“private”, :type=>“media”, :title=>“a new
title”, :owner=>“user/3003”, :author=>“user/3003”, :description=>“more
foo desc”, :id=>“user/3003/media/2”}
index << {:visibility=>“private”, :type=>“media”, :title=>“random
title”, :owner=>“user/3003”, :author=>“user/3003”,
:description=>“random description”, :id=>“user/3003/media/4”}
index << {:visibility=>“private”, :type=>“media”, :title=>“random
title”, :owner=>“user/3003”, :author=>“user/3003”,
:description=>“random description”, :id=>“user/3003/media/5”}

index.search_each(ARGV.shift) { |doc, score|
puts index[doc].load.inspect
}
################

The following queries are returning all the results currently in the
index:

$ ruby script.rb “title:me”
{:author=>“user/3003”, :description=>“description example”,
:visibility=>“private”, :id=>“user/3003/media/1”, :title=>“example
title”, :type=>“media”, :owner=>“user/3003”}
… (remaining results)

$ ruby script.rb “title:my”
(same as above)

And weird enough, the following

$ ruby script.rb “title:mo”

Won’t return anything. There’s more variants to that, but I think you
get my meaning.

The following works properly:

$ ruby script.rb “title:random”
(returns the two results that contain “random” in the title, which is
what is supposed to be)

Is there something I’m missing? It doesn’t seem to make sense to me
that those queries above should return all the results in the index,
specially considering they don’t actually match anything.

Any help is appreciated. Thanks.


#2

On 3/13/07, Julio Cesar O. removed_email_address@domain.invalid wrote:

require ‘rubygems’
foo desc", :id=>“user/3003/media/2”}
################
Thanks for including the script. It makes my job much easier. :slight_smile:

And weird enough, the following

$ ruby script.rb “title:mo”

Won’t return anything. There’s more variants to that, but I think you
get my meaning.

The problem is that ‘me’ and ‘my’ are stop words. When they get
removed the query becomes ‘title:’ which is invalid. By default Ferret
catches query parse exceptions and attempts to parse the query as a
simple boolean term query, removing all special characters, so this
query then becomes ‘title’. Since title can be found in the title
field for all documents, all documents are returned. So I don’t think
this is a bug but it is definitely undesired behaviour. I’ll try and
think of a better way to parse this.

In the mean time, you may want to think about changing the stopword
list or removing stopwords all together to prevent this problem from
occurring.


#3

Thanks David,

I instanced a StandardAnalyzer and passed an empty array for stop
words, and it did the trick.

If anyone wants to comment on what I’m losing by doing this, It would
be really nice.