Not understanding search results

sithadmin · March 31, 2007, 6:42pm

I’m getting some results that I don’t understand from a search.

The code, based on the tutorial, and the results are below.

Everything makes sense to me, except the results for
the ‘title:“Some”’ query. I would think that it should
match the first two documents, but not the third.

What am I missing here?

Thanks for any help!

— code -----------------------------------------------------

require ‘ferret’

def query(index, query_str)
puts(“Query ‘#{query_str}’…”)
index.search_each(query_str) do |id, score|
puts(" id=#{id} score=#{score} uid=#{index[id][:uid]}
title=’#{index[id][:title]}’")
end
end

index = Ferret::Index::Index.new

index << {:uid => ‘one’, :title => ‘Some Title’, :content => ‘my first
text’}
index << {:uid => ‘two’, :title => ‘Some Title’, :content => ‘some
second content’}
index << {:uid => ‘three’, :title => ‘Other Title’, :content => ‘my
third text’}

query(index, ‘content:“text”’)
query(index, ‘content:“some”’)
query(index, ‘title:“Some”’)
query(index, ‘title:“Title”’)
query(index, ‘uid:“two”’)

— results ---------------------------------------

Query ‘content:“text”’…
id=0 score=0.625 uid=one title=‘Some Title’
id=2 score=0.625 uid=three title=‘Other Title’
Query ‘content:“some”’…
id=1 score=0.125318586826324 uid=two title=‘Some Title’
Query ‘title:“Some”’…
id=0 score=0.0554137788712978 uid=one title=‘Some Title’
id=1 score=0.0554137788712978 uid=two title=‘Some Title’
id=2 score=0.0554137788712978 uid=three title=‘Other Title’
Query ‘title:“Title”’…
id=0 score=0.712317943572998 uid=one title=‘Some Title’
id=1 score=0.712317943572998 uid=two title=‘Some Title’
id=2 score=0.712317943572998 uid=three title=‘Other Title’
Query ‘uid:“two”’…
id=1 score=1.0 uid=two title=‘Some Title’

sithadmin · March 31, 2007, 8:48pm

On Mar 31, 2007, at 10:41 AM, Andreas K. wrote:

@David: You should probably consider changing StandardAnalyzer not to
use stop words by default. It confuses people because no one would
suspect such a feature to be enabled by default. It just doesn’t
follow the principle of least astonishment.

Even if people want to use stop words, they might not be happy with
the ones built into Ferret. It very much depends on the nature of the
content that is indexed and instead of using a one-size-fit-all stop
word list one is usually better off with compiling a custom one for
any particular application.

I concur. Ferret’s StandardAnalyzer is based upon Lucene’s class of
the same name, so some parallelism would be lost, but I think
omitting stop lists is better nonetheless.

There are performance and disk-space implications for avoiding stop
lists by default. However, disk space is cheap, Ferret is fast, and
search results are slightly better when you avoid stop lists (e.g.
searching for ‘“the who”’ actually returns something). Users with
large deployments will be able to trade away some amount of IR
precision for increased performance by enabling stop lists if they so
choose.

KinoSearch doesn’t have a StandardAnalyzer; a class called
PolyAnalyzer fills that role. By default, it performs lowercasing,
tokenizing and stemming – but no stopalizing. <http://
www.rectangular.com/kinosearch/docs/devel/KinoSearch/Analysis/
PolyAnalyzer.html>

Marvin H.
Rectangular Research
http://www.rectangular.com/

sithadmin · April 1, 2007, 12:13pm

On Sat, Mar 31, 2007 at 07:41:06PM +0200, Andreas K. wrote:

third text’}
Ferret::Analysis::FULL_ENGLISH_STOP_WORDS for a complete list of
(english) stop words.

In the case of “title:Some”, “Some” is removed by the analyzer giving
only “title:”, i.e. an empty query which (surprisingly) matches all
documents.

However, the same should happen with “content:some” but this one
returns only one document which leaves me completely puzzled. This
just isn’t consistent.

adding the output of index.process_query to the script I get:

Query ‘content:“some”’…
processed to <title:content uid:content content:content>
Query ‘title:“Some”’…
processed to <title:title uid:title content:title>

so it seems the stop word is stripped first, then the query is
recognized as invalid, and the parser does it’s best to run it anyway -
it takes the remaining word that once was the field name, and interprets
it as the query string.

Setting handle_parse_errors to false turns this behaviour off and leads
to no results for the empty queries.

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

sithadmin · April 1, 2007, 4:58pm

At 2007-04-01 06:12, you wrote:

processed to <title:title uid:title content:title>

so it seems the stop word is stripped first, then the query is
recognized as invalid, and the parser does it’s best to run it anyway -
it takes the remaining word that once was the field name, and interprets
it as the query string.

Setting handle_parse_errors to false turns this behaviour off and leads
to no results for the empty queries.

That explains it all.

Thanks much!

sithadmin · March 31, 2007, 7:46pm

On Mar 31, 2007, at 5:36 PM, Jeff M. wrote:

query(index, ‘title:“Title”’)
query(index, ‘uid:“two”’)

Nice one.

When people don’t understand search results, it’s usually to do with
stop words. The StandardAnalyzer which parses documents and(!)
queries, uses a list of stop words which are ignored. See
Ferret::Analysis::FULL_ENGLISH_STOP_WORDS for a complete list of
(english) stop words.

In the case of “title:Some”, “Some” is removed by the analyzer giving
only “title:”, i.e. an empty query which (surprisingly) matches all
documents.

However, the same should happen with “content:some” but this one
returns only one document which leaves me completely puzzled. This
just isn’t consistent.

So I’m afraid I can’t be of much help here, but I’m sure somebody
else will enlighten us. This might as well be a bug, but even if it’s
not, it’s definitely not what anyone would reasonably expect.

–

@David: You should probably consider changing StandardAnalyzer not to
use stop words by default. It confuses people because no one would
suspect such a feature to be enabled by default. It just doesn’t
follow the principle of least astonishment.

Even if people want to use stop words, they might not be happy with
the ones built into Ferret. It very much depends on the nature of the
content that is indexed and instead of using a one-size-fit-all stop
word list one is usually better off with compiling a custom one for
any particular application.

Cheers,
Andy