Scoring/similarity, biased towards small fields?

colin · September 27, 2006, 12:13am

Lucene, and perhaps most search engines, are biased towards small fields
with little content (where thus the term frequency is higher). Lucene
has the option to define a custom (Similarity) class to calculate the
similarity between two fields (custom calculation of lengthNorm and tf)
in different documents. But how do I do this in ferret? (I know to boost
a field, but this is not what I (think to) need, I need to be able to
influence the relative importance between the same field)

colin · September 27, 2006, 12:16am

Forgot to say, ferret seems to be really amazing, especially considering
how much it has been improved in the last couple of months!

colin · September 27, 2006, 11:30am

Thanks for answering! I couldn’t find anything of relevance in the
docs/api, now i know not to look for that functionality in the ruby api
again

Actually boosting doesn’t really help in my case. I use lucene to index
some articles with bodies of variable length. But whether a word occurs
in a short or long article, the article is supposed to be equally
relevant (of course, words occurring in title fields will make the
result more important, for that there is boosting (and this bias towards
short fields))

But it’s only a small issue, maybe i’ll start spitting through the
source-code sometime to see if i can add it.

colin · September 27, 2006, 7:30am

On 9/27/06, Colin Cc [email protected] wrote:

Lucene, and perhaps most search engines, are biased towards small fields
with little content (where thus the term frequency is higher). Lucene
has the option to define a custom (Similarity) class to calculate the
similarity between two fields (custom calculation of lengthNorm and tf)
in different documents. But how do I do this in ferret? (I know to boost
a field, but this is not what I (think to) need, I need to be able to
influence the relative importance between the same field)

Hi Colin,

Ferret uses the same similarity scoring as Lucene. Scoring is based
more on the ratio of number of matches to the length of the field,
rather than just the length of the field. So a small field with a
single match will score higher than a large field with a single match.
But a large field with many matches may still score more highly than a
small field with a single match.

The Similarity class is still unavailable in the Ruby API and it isn’t
high on my list of priorities to write the bindings for it (unless
someone was willing to compensate me). However, I don’t think you need
it for what you are describing. Boosts should do the job perfectly. If
you want to make the :title field more important than the :content
field then you set the boost of the :title FieldInfo, probably like
this:

fi = FieldInfos.new
fi.add_field(:title, :boost => 10.0)

But I think you want to make the same field more important in
different documents. So you can set the boost of the field when you
add it. You can either set the boost for the whole document:

doc = Ferret::Document.new(20.0)
doc[:title] = "Braveheart"
doc[:actors] = ["Mel Gibson", "Sophie Marceau"]

This will affect all fields in the document. Or you can set the boost
of the field directly.

doc = {
    :title => Field.new("Legally Blonde", 0.02),
    :actors => Field.new(["Reese Witherspoon", "Luke Wilson"], 2.0)
}

Hope that helps,
Cheers,
Dave