Article score calculations for Boolean and MultiTerm Queries

soderic · July 10, 2007, 4:06pm

Hi,

I have some questions about the way that documents are scored by the
Boolean
and MultiTerm Queries, and about possible options for custom scoring
articles. I am working on a project experimenting with different methods
of
automatically generating queries and the scoring mechanisms behind
Lucene
and Ferret have been perplexing us.

From looking at the Lucene explanation at (
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html#formula_coord)
and through using the explain function in Ferret it seems that the score
calculation for a boolean query is (in latex)

score = ( querynorm \times fieldnorm ) \sum_{term \in query}{
idf_{term}^{2} tf_{term} boost_{term}}

and the calculation for the score of a document matching a MultiTerm
Query
is

score = ( querynorm \times fieldnorm ) idf_{terms \in query}^{2}
\sum_{term
\in query}{tf_{term} boost_{term}}

I would like to implement something much simpler like

score = \sum_{term \in query}{tf_{term} boost_{term}}

however I’m not incredibly familiar with C, and frankly looking at the
scoring calculation in C inside ferret terrified me. Would the pure ruby
version of ferret be a good place to try to make these changes? The
latest
version of that code that I can find is 0.9.4 or so. What would you
recommend?

Also, do you know why Lucene (and Ferret) use idf squared instead of
just
idf, that seems like a weird choice to me. Another sticking point is
that
the method of calculating idf for the MultiTerm queries (the idf of the
sum
of the df for every term in the query) didn’t seem to make sense. For
example with a query with many common words it is possible that the sum
of
your df’s could be greater than the number of documents in the index.

Many Thanks!
Eric

ps. let me know if the latex equations is too obtuse, and I will try to
find
another way to express sums in email