Hi,

I have some questions about the way that documents are scored by the

Boolean

and MultiTerm Queries, and about possible options for custom scoring

articles. I am working on a project experimenting with different methods

of

automatically generating queries and the scoring mechanisms behind

Lucene

and Ferret have been perplexing us.

From looking at the Lucene explanation at (

http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html#formula_coord)

and through using the explain function in Ferret it seems that the score

calculation for a boolean query is (in latex)

score = ( querynorm \times fieldnorm ) \sum_{term \in query}{

idf_{term}^{2} tf_{term} boost_{term}}

and the calculation for the score of a document matching a MultiTerm

Query

is

score = ( querynorm \times fieldnorm ) idf_{terms \in query}^{2}

\sum_{term

\in query}{tf_{term} boost_{term}}

I would like to implement something much simpler like

score = \sum_{term \in query}{tf_{term} boost_{term}}

however I’m not incredibly familiar with C, and frankly looking at the

scoring calculation in C inside ferret terrified me. Would the pure ruby

version of ferret be a good place to try to make these changes? The

latest

version of that code that I can find is 0.9.4 or so. What would you

recommend?

Also, do you know why Lucene (and Ferret) use idf squared instead of

just

idf, that seems like a weird choice to me. Another sticking point is

that

the method of calculating idf for the MultiTerm queries (the idf of the

sum

of the df for every term in the query) didn’t seem to make sense. For

example with a query with many common words it is possible that the sum

of

your df’s could be greater than the number of documents in the index.

Many Thanks!

Eric

ps. let me know if the latex equations is too obtuse, and I will try to

find

another way to express sums in email