Article score calculations for Boolean and MultiTerm Queries



I have some questions about the way that documents are scored by the
and MultiTerm Queries, and about possible options for custom scoring
articles. I am working on a project experimenting with different methods
automatically generating queries and the scoring mechanisms behind
and Ferret have been perplexing us.

From looking at the Lucene explanation at (
and through using the explain function in Ferret it seems that the score
calculation for a boolean query is (in latex)

score = ( querynorm \times fieldnorm ) \sum_{term \in query}{
idf_{term}^{2} tf_{term} boost_{term}}

and the calculation for the score of a document matching a MultiTerm

score = ( querynorm \times fieldnorm ) idf_{terms \in query}^{2}
\in query}{tf_{term} boost_{term}}

I would like to implement something much simpler like

score = \sum_{term \in query}{tf_{term} boost_{term}}

however I’m not incredibly familiar with C, and frankly looking at the
scoring calculation in C inside ferret terrified me. Would the pure ruby
version of ferret be a good place to try to make these changes? The
version of that code that I can find is 0.9.4 or so. What would you

Also, do you know why Lucene (and Ferret) use idf squared instead of
idf, that seems like a weird choice to me. Another sticking point is
the method of calculating idf for the MultiTerm queries (the idf of the
of the df for every term in the query) didn’t seem to make sense. For
example with a query with many common words it is possible that the sum
your df’s could be greater than the number of documents in the index.

Many Thanks!

ps. let me know if the latex equations is too obtuse, and I will try to
another way to express sums in email