Hitting Files per Directory Limits with Ferret?

Hey all!

We’ve been using Ferret to great success these past six months. But
recently we’ved tried adding many new ContentItems (only thing being
index by Ferret at the moment), and things came crashing to a halt.

ferret gem: 0.10.9
acts_as_ferret plugin (not sure which version)

How we’re using the plugin:

class ContentItem < ActiveRecord::Base
acts_as_ferret :fields => { ‘title’ => {},
‘description’ => {}
}

end

In the directory (on production) ‘index/production/content_item’, there
are now 45812 files. (this is on Fedora Core 5, btw)

This leads me to believe this could be the culprit… and if not the
culprit now, it will be soon.

ContentItem.count
=> 19603

Any ideas?

Any help would be mucho appreciated. Thanks!

Ferret can optimize its index, which will collapse the files in an index
directory. Sadly enough, acts_as_ferret does not call it unless you
choose to rebuild its entire index. This could solve your problem:
ContentItem.rebuild_index.

This might take a while…

Regards,
Ewout

Just a heads up… rebuilding the index did the trick.
http://www.ruby-forum.com/topic/89245

I’m curious though, how many items can Ferret reasonably be expected to
scale to?

And, if anyone has hit Ferret’s natural limits, are there any solutions
(i.e. partitioning the index into manageable chunks, etc) that still use
Ferret as the base search indexer / engine?

Fez Bojangles wrote:

Hey all!

We’ve been using Ferret to great success these past six months. But
recently we’ved tried adding many new ContentItems (only thing being
index by Ferret at the moment), and things came crashing to a halt.

ferret gem: 0.10.9
acts_as_ferret plugin (not sure which version)

How we’re using the plugin:

class ContentItem < ActiveRecord::Base
acts_as_ferret :fields => { ‘title’ => {},
‘description’ => {}
}

end

In the directory (on production) ‘index/production/content_item’, there
are now 45812 files. (this is on Fedora Core 5, btw)

This leads me to believe this could be the culprit… and if not the
culprit now, it will be soon.

ContentItem.count
=> 19603

Any ideas?

Any help would be mucho appreciated. Thanks!

Actually, it does not. The only call to index.optimize is in the
rebuild_index method. A possible extension for aaf is that
index.optimize is called automatically each C insertions, where C is
some constant (1000 seems reasonable).

I can only agree with Jan on scalability, at the moment I’m keeping an
index of over 700.000 bibliographic records. Searches are instant.

Regards,
Ewout

Hey Fez,

the limit of indexed items of ferret (and lucene) shouldn’t be in the
thousands but in the millions. I’ve indexed hundreds of thousands of
documents myself with ferret as well as with lucene and 20.000 is not
even
near the limit. Regarding the file-count in the index directory: It
seems as
if the index was never optimized. This defragments the chunks into one
big
index file. You should investigate why this didn’t happen. I did not
look
into the aaf code for some time but I think that it should do index
optimization from time to time.

Cheers,
Jan

Ferret itself does not automatically optimize itself after so many
document insertions?

Lucene does, but maybe Ferret does not? It certainly causes
indexing hiccups when it hits that optimization with Lucene, so care
has to be taken to be sure you account for that possible optimization
delay or to tune the parameters so you know when to expect it.

Erik

I created a patch for acts_as_ferret that will optimize the index every
100 insertions (experience will have to show weither this constant is
adequate).

The only prerequisite is that your model has an id attribute that
increases 1 by 1, automatically, since the id is used to determine when
to optimize.

Just apply this patch to instance_methods.rb of acts_as_ferret to try
it.

Hope this will be of use.

If ferret would implement automatic optimization, it should indeed be
optional and parameterizable.

For example: suppose you are indexing 500 013 documents. After indexing,
you would naturally call index.optimize. But suppose ferret
automatically optimizes every 1000 insertions. Obviously, there’s lots
of overhead in here (optimize 501 times instead of just once).

The ideal solution would be parellel:

  • index optimization happens in a separate process
  • while optimizing, the old index is still available

Is this possible now? Is Ferret safe enough to allow one process to
optimize the index while another is using it?

Also, anyone has data about the duration of an optimization process? I
don’t think it takes too long, but haven’t got any concrete data on that
(yet).

Ewout