Understanding boost?

Neville_B · September 20, 2006, 9:56am

Hi,

I’m confused about managing field boosting …

I have set the :boost for the :name field in my docs to 10, via :boost
=> 10

Then I performed a search for ‘keith’ over all fields via with
:(keith), expecting a doc with Keith in the :name field to come out on
top. But another doc with Keith mentioned in other fields (:comments,
:address) scored higher.

I viewed the explanation from the searcher, but it wasn’t clear to me
why the boost wasn’t pushing the :name = Keith document to the top.

Any help on understanding field boosting and explain would be great.

Regards

Neville

PS, the two explains are:

Doc1:
0.3352959 = product of:
8.047102 = sum of:
4.011141 = weight(comments:<keith|[email protected]|keithex> in
4697), product of:
0.5685414 =
query_weight(comments:<keith|[email protected]|keithex>), product of:
28.22057 = idf(comments:<(keithex=1) + ([email protected]=1) +
(keith=115) = 117>)
0.02014635 = query_norm
7.055143 = field_weight(comments:<keith|[email protected]|keithex>
in 4697), product of:
1.0 = The sum of:
1.0 = tf(term_freq(comments:keithex)=1)^1.0
28.22057 = idf(comments:<(keithex=1) + ([email protected]=1) +
(keith=115) = 117>)
0.25 = field_norm(field=comments, doc=4697)
4.03596 = weight(address:<keith|keithex> in 4697), product of:
0.4032613 = query_weight(address:<keith|keithex>), product of:
20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>)
0.02014635 = query_norm
10.0083 = field_weight(address:<keith|keithex> in 4697), product
of:
1.0 = The sum of:
1.0 = tf(term_freq(address:keithex)=1)^1.0
20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>)
0.5 = field_norm(field=address, doc=4697)
0.04166667 = coord(2/48)

Doc2:
0.2977623 = product of:
14.29259 = weight(name: in 31416), product of:
0.2028171 = query_weight(name:), product of:
10.06719 = idf(name:<(keith=3) = 3>)
0.02014635 = query_norm
70.47034 = field_weight(name: in 31416), product of:
1.0 = The sum of:
1.0 = tf(term_freq(name:keith)=1)^1.0
10.06719 = idf(name:<(keith=3) = 3>)
7.0 = field_norm(field=name, doc=31416)
0.02083333 = coord(1/48)

Neville_B · September 20, 2006, 12:42pm

Hi!

On Wed, Sep 20, 2006 at 03:40:03PM +1000, Neville B. wrote:

:address) scored higher.

I viewed the explanation from the searcher, but it wasn’t clear to me
why the boost wasn’t pushing the :name = Keith document to the top.

as you can see from the explanation, the score for both fields that
matched the query got summed up (8… = sum of:), if ‘keith’ only had
shown up in one field, the other document would have had the higher
score.

I don’t know of any methodology to determine the proper boost setting
for a field, imho it’s just a question of experimenting with queries and
the results you expect.

If you always want to have matches in the name ranked on the top,
regardless of how many times a term is mentioned in other parts of your
document, set the boost to 100

I don’t know what the coord value is, though, maybe someone else can
step in here ?

Jens

(keith=115) = 117>)
20.0166 = idf(address:<(keithex=1) + (keith=8) = 9>)
Doc2:
0.02083333 = coord(1/48)

Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Neville_B · September 20, 2006, 2:06pm

On 9/20/06, Neville B. [email protected] wrote:

:address) scored higher.
PS, the two explains are:
0.02014635 = query_norm
0.02014635 = query_norm
0.2977623 = product of:
14.29259 = weight(name: in 31416), product of:
0.2028171 = query_weight(name:), product of:
10.06719 = idf(name:<(keith=3) = 3>)
0.02014635 = query_norm
70.47034 = field_weight(name: in 31416), product of:
1.0 = The sum of:
1.0 = tf(term_freq(name:keith)=1)^1.0
10.06719 = idf(name:<(keith=3) = 3>)
7.0 = field_norm(field=name, doc=31416)
0.02083333 = coord(1/48)

Hi Neville,

The field’s boost value affects the field_norm value in the
Explanations above. Here is how it is calculated:

field_norm = field_info->boost * doc->boost * field->boost *
            (1 / sqrt(field->num_terms)

So as you can see from the Explanations above, field_norm is 7.0 on
the boosted field which is more than 10 times the field_norms on the
other two fields (0.25, 0.5) so at least you can see the boost is
having an effect. The address field probably has a higher field_norm
value than the comments field because the comments field is longer
(see that last part of the field_norm equation). Note that the reason
the boost is 7.0 and not 10.0 is that the field_norm gets stored in a
single byte so there is quite a large loss of precision.

Having said all this, there does seem to be a problem with the
calculations. I don’t think I’ve calculated the idf value correctly
for MultiTermQueries. I’ve rectified this in subversion so the next
version should give your results in an order that you’d expect.

For information on tf and idf, check out this page:

http://en.wikipedia.org/wiki/Tf-idf

Hope that helps. I’d love to give a better explanation of the scoring
but I don’t have time right now.

Cheers,
Dave

Neville_B · September 20, 2006, 3:40pm

On 9/20/06, Jens K. [email protected] wrote:

Then I performed a search for ‘keith’ over all fields via with
score.
step in here ?

Jens

The coord factor is the number of clauses in a BooleanQuery that
matched over the number of clauses. It would seem that in the example,
there were 48 clauses. When you submit a query over all fields (ie.
“*:term”) the query is rewritten as a boolean query with a clause for
every field in your index. So it would seem that Neville has 48 fields
in his index.

Hope that makes sense,

Dave

PS: This might be a good time to mention that if you have an index
with a lot of fields like this, it is probably worth thinking about
what to set the :default_field and :all_fields parameters to.
:all_fields is what “:#{query}" expands to. It doesn’t necessarily
have to be all fields in the index. Usually you only want "” to
expand to all text fields, not actually all fields. For example, you’d
probably want date fields to be excluded. And I’ve only just fixed
this so it will work when you use a Ferret::Index::Index object.
Previously the QueryParser had all fields in the index added to the
:all_fields parameter. Now that only happens if :all_fields isn’t set
explicitly.