Sorting performance

Pedro_SSISO-8859-1 · July 31, 2006, 12:06pm

I’m using acts_as_ferret to index one of my rails models. Right after I
start the app the first request that orders by some ferret field will
take very long. Subsequent ones seem to be fast. I guess some caching is
going on. Any tips on solving this?

Pedro.

Pedro_SSISO-8859-1 · July 31, 2006, 12:22pm

On 7/31/06, Pedro Côrte-Real [email protected] wrote:

I’m using acts_as_ferret to index one of my rails models. Right after I
start the app the first request that orders by some ferret field will
take very long. Subsequent ones seem to be fast. I guess some caching is
going on. Any tips on solving this?

Pedro.

You guessed correctly. The sort fields are cached. You can easily
preload the cache by running a search when you start up your app. You
should also be careful what fields you sort on. You should only sort
on untokenized fields. You can also speed up sorting by dates by
lowering the precision that you use. For example, if you are storing
the date with time to the nearest second, eg 2006-08-01 10:13:24 you
may get a much faster sort by only storing up to the nearest day, ie
2006-08-01. By the way, what kind of times are we talking about here?

Cheers,
Dave

Pedro_SSISO-8859-1 · July 31, 2006, 12:28pm

On Mon, 2006-07-31 at 19:17 +0900, David B. wrote:

should also be careful what fields you sort on. You should only sort
on untokenized fields.

Is it ok if the field isn’t stored in the index?

Anyone know how to set a field to be untokenized in acts_as_ferret?

You can also speed up sorting by dates by
lowering the precision that you use. For example, if you are storing
the date with time to the nearest second, eg 2006-08-01 10:13:24 you
may get a much faster sort by only storing up to the nearest day, ie
2006-08-01.

I’m only using dates so it should be alright.

By the way, what kind of times are we talking about here?

300 seconds for a 100MB index.

Pedro.

Pedro_SSISO-8859-1 · July 31, 2006, 6:09pm

On Mon, 2006-07-31 at 11:26 +0100, Pedro CÃ´rte-Real wrote:

Anyone know how to set a field to be untokenized in acts_as_ferret?

I forgot that I was actually supplying my own #to_doc so it was a matter
of changing it to not tokenize the fields I want. When using
acts_as_ferret the regular way I don’t know if this is possible.

Pedro.

Pedro_SSISO-8859-1 · July 31, 2006, 6:10pm

On Mon, 2006-07-31 at 19:17 +0900, David B. wrote:

By the way, what kind of times are we talking about here?

I added a preloading of this at the start of my app and it takes 14
minutes for a 100MB index with 4 fields I order by. Any way to speed
this up? Shouldn’t this be cached in the on-disk structure?

Don’t think I’m being critical, ferret is great software, many thanks
for it.

Pedro.

Pedro_SSISO-8859-1 · July 31, 2006, 7:40pm

On Mon, Jul 31, 2006 at 04:10:03PM +0100, Pedro Côrte-Real wrote:

On Mon, 2006-07-31 at 11:26 +0100, Pedro Côrte-Real wrote:

Anyone know how to set a field to be untokenized in acts_as_ferret?

I forgot that I was actually supplying my own #to_doc so it was a matter
of changing it to not tokenize the fields I want. When using
acts_as_ferret the regular way I don’t know if this is possible.

it is, just provide a hash with the desired options to each field name:

acts_as_ferret(
:fields => {
‘title’ => { :boost => 2 },
‘description’ => { :boost => 1,
:index =>
Ferret::Document::Field::Index::UNTOKENIZED
}
})

options that can be set this way are (with their defaults given):

:store => Ferret::Document::Field::Store::NO
:index => Ferret::Document::Field::Index::TOKENIZED
:term_vector => Ferret::Document::Field::TermVector::NO
:binary => false
:boost => 1.0

Jens

–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66

Pedro_SSISO-8859-1 · August 1, 2006, 11:30am

On Mon, 2006-07-31 at 19:36 +0200, Jens K. wrote:

                   :index => Ferret::Document::Field::Index::UNTOKENIZED 
                 }
})

options that can be set this way are (with their defaults given):

:store => Ferret::Document::Field::Store::NO
:index => Ferret::Document::Field::Index::TOKENIZED
:term_vector => Ferret::Document::Field::TermVector::NO
:binary => false
:boost => 1.0

Cool. Didn’t know about this. I started reading the code to understand
how it worked but then remembered I was doing my own to_doc so I should
just change that. I’ll be sure to remember that for any future projects.

By the way, does storing the TermVectors only increase the size of the
index or does it alter performance in any way?

Pedro.

Pedro_SSISO-8859-1 · August 1, 2006, 2:27am

On 8/1/06, Pedro Côrte-Real [email protected] wrote:

On Mon, 2006-07-31 at 19:17 +0900, David B. wrote:

By the way, what kind of times are we talking about here?

I added a preloading of this at the start of my app and it takes 14
minutes for a 100MB index with 4 fields I order by. Any way to speed
this up? Shouldn’t this be cached in the on-disk structure?

How many documents and what is the date range (eg 2001-01-01 →
2006-08-01). These are the critical variables for sort performance.
Once I know these numbers I’ll be able to replicate the task here and
I’ll see what I can do.

Don’t think I’m being critical, ferret is great software, many thanks
for it.

No offence taken. I’d definitely like to be able to help. I’m guessing
I’ll probably have to optimize the C code to rectify this.

Cheers,
Dave

Pedro_SSISO-8859-1 · August 1, 2006, 11:40am

On 8/1/06, Pedro Côrte-Real [email protected] wrote:

'description' => { :boost => 1,
:boost => 1.0
By the way, does storing the TermVectors only increase the size of the
index or does it alter performance in any way?

It increases the size of the index and affects indexing performance
since a lot of extra data needs to be written and merged during the
indexing process. Search performance won’t be affected.

Dave

Pedro_SSISO-8859-1 · August 1, 2006, 11:37am

On Tue, 2006-08-01 at 09:24 +0900, David B. wrote:

How many documents and what is the date range (eg 2001-01-01 ->
2006-08-01). These are the critical variables for sort performance.
Once I know these numbers I’ll be able to replicate the task here and
I’ll see what I can do.

I have around 600_000 documents and the date range is rather large,
something like from year 1000 to now. I don’t know for sure but I can
check if it makes a difference.

But not all my sort fields are dates. I also have regular text fields
that I have now made untokenized (by using separate fields for sorting
and searching). Got to check if that made them faster.

Don’t think I’m being critical, ferret is great software, many thanks
for it.

No offence taken. I’d definitely like to be able to help. I’m guessing
I’ll probably have to optimize the C code to rectify this.

That would be great,

Thanks,

Pedro.

Pedro_SSISO-8859-1 · August 1, 2006, 12:02pm

On 8/1/06, Pedro Côrte-Real [email protected] wrote:

But not all my sort fields are dates. I also have regular text fields
that I have now made untokenized (by using separate fields for sorting
and searching). Got to check if that made them faster.

Hmmm. Sounds like an interesting application. One solution would be to
cache the sort index on disk. The problem with this is that the cache
would still need to be recalculated every time you add more documents
to the index so you’ll still have the long wait occasionally. I’ll
look into it anyway at a later stage.

Another idea that I can implement now is to add a BYTES sort type
which would basically sort by the order the terms appear in the index.
Let’s say you index dates in the format “YYYYMMDD” and you sort by
INTEGER. Everytime you load the sort index you need to go through
every single date and convert it from string to integer. But this is
unnecessary since the dates are already in order in the index. A BYTES
sort type would take advantage of this. You’d get an even bigger
benefit for ascii strings. strcoll is used to sort strings but this is
unnecessary for ascii strings as they are already correctly ordered in
the index. Also, the index needs to keep each string in memory which
would also be unnessary.

Sorry if this isn’t very clear. I’m not sure how much it will help.
We’ll have to wait and see.

Dave

Pedro_SSISO-8859-1 · August 1, 2006, 11:56am

On Tue, 2006-08-01 at 18:39 +0900, David B. wrote:

By the way, does storing the TermVectors only increase the size of the
index or does it alter performance in any way?

It increases the size of the index and affects indexing performance
since a lot of extra data needs to be written and merged during the
indexing process. Search performance won’t be affected.

Ah, but the default when creating a new field is already not to store it
so I’m already doing it.

Pedro.

Pedro_SSISO-8859-1 · August 1, 2006, 12:09pm

On Tue, 2006-08-01 at 18:59 +0900, David B. wrote:

Hmmm. Sounds like an interesting application. One solution would be to
cache the sort index on disk. The problem with this is that the cache
would still need to be recalculated every time you add more documents
to the index so you’ll still have the long wait occasionally. I’ll
look into it anyway at a later stage.

For my application this wouldn’t really be a problem since data is only
loaded maybe once a week. But does the cache need to be recalculated
completely? Database indexes work incrementally.

Another idea that I can implement now is to add a BYTES sort type
which would basically sort by the order the terms appear in the index.
Let’s say you index dates in the format “YYYYMMDD” and you sort by
INTEGER. Everytime you load the sort index you need to go through
every single date and convert it from string to integer. But this is
unnecessary since the dates are already in order in the index. A BYTES
sort type would take advantage of this.

For my date fields this would work.

You’d get an even bigger
benefit for ascii strings. strcoll is used to sort strings but this is
unnecessary for ascii strings as they are already correctly ordered in
the index. Also, the index needs to keep each string in memory which
would also be unnessary.

One of my text order fields should have nothing but ASCII. The other is
a title and can include arbitrary UTF-8, so I guess it wouldn’t work for
that one.

Sorry if this isn’t very clear. I’m not sure how much it will help.
We’ll have to wait and see.

The BYTES ordering would probably speed it up but for my specific case,
storing it on disk would be perfect. It would probably be a very good
thing in case someone uses ferret to code command line tools that access
a common index. Without storing the sorting on disk it will get
recreated every time a command is ran.

Pedro.

Pedro_SSISO-8859-1 · August 2, 2006, 4:14am

On 8/1/06, Pedro Côrte-Real [email protected] wrote:

On Tue, 2006-08-01 at 18:59 +0900, David B. wrote:

Hmmm. Sounds like an interesting application. One solution would be to
cache the sort index on disk. The problem with this is that the cache
would still need to be recalculated every time you add more documents
to the index so you’ll still have the long wait occasionally. I’ll
look into it anyway at a later stage.

For my application this wouldn’t really be a problem since data is only
loaded maybe once a week. But does the cache need to be recalculated
completely? Database indexes work incrementally.

Have you tried optimizing your index? I found an order of magnitude
difference in speed here with an optimized index. Even with 1,000,000
unique documents though sorting is taking less than 10 seconds for an
unoptimized index and less than 1 second for optimized index. What
kind of system are you running on?

Dave

Pedro_SSISO-8859-1 · August 3, 2006, 12:17pm

On Wed, 2006-08-02 at 11:13 +0900, David B. wrote:

completely? Database indexes work incrementally.

Have you tried optimizing your index? I found an order of magnitude
difference in speed here with an optimized index. Even with 1,000,000
unique documents though sorting is taking less than 10 seconds for an
unoptimized index and less than 1 second for optimized index. What
kind of system are you running on?

I was guessing acts_as_ferret did that. But apparently only on
rebuild_index. I’ll try adding an optimize call at the start of the app.

I’m running this on a 2.66 GHz Celeron with 1GB ram.

Pedro.

Pedro_SSISO-8859-1 · August 1, 2006, 12:34pm

On 8/1/06, Pedro Côrte-Real [email protected] wrote:

On Tue, 2006-08-01 at 18:59 +0900, David B. wrote:

Hmmm. Sounds like an interesting application. One solution would be to
cache the sort index on disk. The problem with this is that the cache
would still need to be recalculated every time you add more documents
to the index so you’ll still have the long wait occasionally. I’ll
look into it anyway at a later stage.

For my application this wouldn’t really be a problem since data is only
loaded maybe once a week. But does the cache need to be recalculated
completely? Database indexes work incrementally.

Sure it’s possible but it’s a fair bit of work. Lucene doesn’t have
anything like this yet (not that that has stopped me adding features
before). I’ll think about it.

Dave

Pedro_SSISO-8859-1 · August 3, 2006, 5:28pm

On Wed, 2006-08-02 at 11:13 +0900, David B. wrote:

Have you tried optimizing your index? I found an order of magnitude
difference in speed here with an optimized index. Even with 1,000,000
unique documents though sorting is taking less than 10 seconds for an
unoptimized index and less than 1 second for optimized index. What
kind of system are you running on?

You were right. I benchmarked it at about 10x faster to preload the
indexes, even counting the time to run #optimize.

Thanks for the tip.

Pedro.