Recalculating the score

kuzb · July 4, 2006, 3:19pm

Hey …

I’m using ferret to index various objects and i’m create a
Ferret::Document for each of these objects. Indexing and searching is
working fine.

Each of these Ferret::Documents has a ‘relevance’ field, storing an
integer, how relevant this object is for the search. The ‘relevance’ is
in the range of 1…10

Now i would like to multiply the relevance of the document with the
score, and sort the results by that.

e.g.:
A document with a score of 0.82 and a relevance of 3 should have a final
score of 2.46

I couldn’t figure out how to do this …

I’ve read the ‘Balancing relevancy and recentness’ thread…

 score = yield( doc, score ) if block_given?
This allows a block attached to a search call to adjust
document scores before documents are sorted, based on
some (possibly dynamic) numerical factors associated
with the document, e.g. the number and importance

i guess this works for the pure ruby implementation but won’t work for
the c-implementation?

As long as Ferret does what Lucene does with boosts, you could scale
document boosts at indexing time by some factor related to age and
that will factor into scoring.

Boost won’t help me here, i’ve even set the boost value for relevance to
0.0, as it should not be part of the query…

Is there any way on how to recaluclate the score?

Thanks,
Ben

kuzb · July 6, 2006, 5:56am

On 7/4/06, Benjamin K. [email protected] wrote:

Now i would like to multiply the relevance of the document with the
 score = yield( doc, score ) if block_given?
This allows a block attached to a search call to adjust
document scores before documents are sorted, based on
some (possibly dynamic) numerical factors associated
with the document, e.g. the number and importance
i guess this works for the pure ruby implementation but won’t work for
the c-implementation?

Hi Ben,
You are right, this is only possible in the pure ruby version. A more
flexible framework for sorting will be coming in the future but
currently you can only sort by integer, float, string, doc_id, and
relevance.

As long as Ferret does what Lucene does with boosts, you could scale
document boosts at indexing time by some factor related to age and
that will factor into scoring.

Boost won’t help me here, i’ve even set the boost value for relevance to
0.0, as it should not be part of the query…

Is there any way on how to recaluclate the score?

How about setting the boost for the whole document rather than just
the :relevance field? Or do you sometimes want to sort by relevance
without taking the :relevance field into account?

Cheers,
Dave

PS: While we are on the topic, how would you like the sort API to
look? Many have complained that the sort API is too java-like but
no-one has suggested any improvements yet. I’d love to see some ideas.

kuzb · July 7, 2006, 7:23pm

Hey David,

thanks for the answer …

How about setting the boost for the whole document rather than just
the :relevance field? Or do you sometimes want to sort by relevance
without taking the :relevance field into account?

ah… you mean i should boost each field of the document? or is there a
way to set a boost level for the document as a whole? if so, i’ve missed
it …

PS: While we are on the topic, how would you like the sort API to
look? Many have complained that the sort API is too java-like but
no-one has suggested any improvements yet. I’d love to see some ideas.

i like the idea of giving a short block with a sort algorithm… i would
like to see something like that:

index.search ( :query => my_query,
:sort => Proc.new( |doc| # some caluclation; return
new_score ),
:reverse => false,
:filter => false,
:start => 0,
:limit => 10 )

alternativly you should be able to give the sort param a name of a
filed, like ‘:sort => :score’ or an array of fields like ‘:sort => [
:score, :title ]’ and sort by the first element and then by the 2nd if
the two or more docs share the same value for the 1st element.
I guess something like “:sort => :score” is enough for most people …

i think the other options are almost like it is implemented right now …
i don’t think you nee the SortField class.

btw… i do find the filter API not really intuitive, actually i didn’t
understand it at all

i know what you want to do with filters and how you want to get there,
but i haven’t found any understandable documentation, on how to build
one …

maybe you should write a short tutorial on how to write a filter… i
would find it very intuitive, to have something like a base_query… like
having one query to filter/limit results, and have another query to do
the real search…

and btw… one feature i would definitely would like to see is to limit
the search on a number of fields…

i know i can write something like

field_one:“search string” || field_two:"search
string||field_three:“search string”||field_four:“search string”

but i would like to be able to write something like

(field_one|field_two|field_three|field_four):“search string”

furthermore, you should be able to say something like … search in all
fields, except field_one … like

(*|!field_one):“search string”

Ben

kuzb · July 8, 2006, 1:02am

On 7/8/06, Benjamin K. [email protected] wrote:

it …
doc = Ferret::Document::Document.new()
doc.boost = 100.0

           :reverse => false,
           :filter => false,
           :start => 0,
           :limit => 10 )

The way sort works at the moment is that it caches all fields that are
sorted on. If you start doing sort like this and you have to load
every document in the result set which would have a huge performance
hit. I guess I could make this feature available though.

In the pure ruby version of Ferret you can do this;

st_length = SortField::SortType.new("length", lambda{|str|

str.length})
sf = SortField.new(“content”, {:sort_type => st_length,
:reverse => true,
:comparator => lambda{|i,j| j <=> i}})

The sort type lambda allows you to create the sort cache. Then the
comparator lets you compare those two values. This is flexible while
remaining performant, although I still think I can make it more
intuitive.

alternativly you should be able to give the sort param a name of a
filed, like ‘:sort => :score’ or an array of fields like ‘:sort => [
:score, :title ]’ and sort by the first element and then by the 2nd if
the two or more docs share the same value for the 1st element.
I guess something like “:sort => :score” is enough for most people …

Actually, you can already do this. Have you tried it? Only :score is
treated as a field name. You’d have to do this;

index.search_each(query, :sort => [SortField::RELEVANCE, :title,

:price])

maybe you should write a short tutorial on how to write a filter… i
would find it very intuitive, to have something like a base_query… like
having one query to filter/limit results, and have another query to do
the real search…

I will. The TermEnum and TermDocEnum are essential for using filters
and they’ve undergone major changes so I’ll hold off on this until I
get the next release out.

(field_one|field_two|field_three|field_four):“search string”
You can do this already, just get rid of the brackets;

field_one|field_two|field_three|field_four:"search string"

furthermore, you should be able to say something like … search in all
fields, except field_one … like

(*|!field_one):“search string”

You can’t do this, but it is a nice idea. I’ll think about it. I might
also add the brackets into the syntax.

Anyway, thanks for your feedback Ben. I will definitely use it.

Cheers,
Dave