Advise on slowness in bootstrapping?

I am looking at trying to use ferret/aaf to supplement my querying
against a
medium and large table with lots of columns. Some facts first:

Ferret 0.11.4
AAF 0.4.0
Ruby 1.8.6
Rails 1.2.3

Medium table:
105,464 rows
168 columns (mostly varchar(20))
11 actual columns indexed in aaf plus
40 virtual columns indexed in aaf (virtual is concat of two physical
columns.
e.g. cast_first_name_1 + cast_last_name_1 through cast_first_name_20 +
cast_last_name_20)

Large table:
1,244,716 rows
same column/index structure

These tables are not updated via Ruby, only read. I am trying to use
rebuild_index to bootstrap the medium sized table and it is taking a
very long
time (running for about 4 hours, indicates 50% complete with 4 hours
remaining)
and creating a massive number of files in the index directory (currently
about
65k, was 90k earlier)

I have not done any tuning of ferret/aaf so far, and I fear what it will
look
like to do the big table. Does anyone have any advise on how to speed
this
process up? Because the tables are updated by an external batch
process, if I
were to continue down this ferret/aaf path, I’d have to be looking at
running
this rebuild_index a couple of times per week which would be rather
painful
given the present time and might not be possible if the large table took
more
than 48 hours…

p.s. Please forgive my lack of attention to the changes I let the
spell checker make. All instances of the verb advise should be
mentally replaced with the noun advice. :slight_smile:

On Thu, Jun 07, 2007 at 05:19:26PM +0000, Daniel Einspanjer wrote:

168 columns (mostly varchar(20))
rebuild_index to bootstrap the medium sized table and it is taking a very long
time (running for about 4 hours, indicates 50% complete with 4 hours remaining)
and creating a massive number of files in the index directory (currently about
65k, was 90k earlier)

strange. Ferret is faster than that - I have a test script that builds
an index of 100000 documents with 50 fields each containing a single
random
word in under 10 Minutes here on standard hardware.

Maybe the problem is something else? For starters, change line 220
of local_index.rb from
index << rec.to_doc if rec.ferret_enabled?(true)
to
doc = rec.to_doc if rec.ferret_enabled?(true)

so nothing is added to the index. How long does that take?

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

On Fri, Jun 08, 2007 at 10:25:07AM -0400, Daniel Einspanjer wrote:

when it hit 100%, the following lines appeared:
reindex model CurrentProgram : 99.56% complete : 219.29 secs to finish
Created Ferret index in:
./script/…/config/…/config/…/index/production/current_program
rebuild index: [[“CurrentProgram”]]
reindexing model CurrentProgram
reindex model CurrentProgram : 0.00% complete : 25740.65 secs to finish
reindex model CurrentProgram : 0.95% complete : 26065.95 secs to finish

So it looks like for some reason, it performed the rebuild twice. :frowning:

damn, that bug seems to come back from time to time, I’ll try to fix
this over the weekend.

When I looked at it this morning, it had over 116k files in the
current_program directory. Not the most healthy thing. I ran
CurrentProgram.aaf_index.ferret_index.optimize and it took a few
minutes and fully optimized down to three files.

It should optimize the index automatically after re-indexing.

I made the testing patch suggested and am running now. I did not
delete the index directory. The ferret_index.log started out with
these lines:
rebuild index: [[“CurrentProgram”]]
reindexing model CurrentProgram
reindex model CurrentProgram : 0.00% complete : 3540.78 secs to finish
reindex model CurrentProgram : 0.95% complete : 3510.69 secs to finish

So it is a significantly shorter time when it isn’t actually adding
the doc to the index.

Yeah, looks like it’s really the indexing that takes the time. Can you
make sure for your testing that nothing else accesses the index while
the rebuild runs (i.e. shutdown any mongrels running?

Or try aaf trunk and the DRb server which will ensure that by design and
for performance measurements is the more realistical scenario anyway.

If you have any further ideas on things to try or any other
information you’d like to collect, please let me know. In the
meantime, I’m going to try out the acts_as_solr plugin since I’ve had
a bit more experience with tuning solr and see what the indexing
performance on that looks like.

From what I’ve heard it should be on par with aaf when things are
working normal (I guess they don’t for some reason in your case).

btw, what platform do you run on?

Jens


Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

On 6/8/07, Jens K. [email protected] wrote:

On Fri, Jun 08, 2007 at 10:25:07AM -0400, Daniel Einspanjer wrote:
damn, that bug seems to come back from time to time, I’ll try to fix
this over the weekend.

I saw a couple of other threads mentioning something similar to this
so I figured it either wasn’t fixed in the version I was working with
or it might have been a regression.

When I looked at it this morning, it had over 116k files in the
current_program directory. Not the most healthy thing. I ran
CurrentProgram.aaf_index.ferret_index.optimize and it took a few
minutes and fully optimized down to three files.

It should optimize the index automatically after re-indexing.

I see in the rebuild_index method where it calls optimize, but it
certainly didn’t seem to fully optimize it at that time. Maybe there
was something specific to the case of a newly created index instead of
opening an existing one?

Yeah, looks like it’s really the indexing that takes the time. Can you
make sure for your testing that nothing else accesses the index while
the rebuild runs (i.e. shutdown any mongrels running?

Since this was a bootstrapping test, I had no processes running other
than the script\console production from which I issued the
rebuild_index command.

Or try aaf trunk and the DRb server which will ensure that by design and
for performance measurements is the more realistical scenario anyway.

I’m currently planning on running this as a single instance
application because the index will be read only at run time and only
used by one or two people at a time.

From what I’ve heard it [aas] should be on par with aaf when things are
working normal (I guess they don’t for some reason in your case).

I’ve heard the same. The only reason I thought to try it out was
because of my prior experience with Solr.

btw, what platform do you run on?

This is a windows box connecting to a MSSQL server. (I know… ick. :wink:
I did some preliminary testing to make sure that the pagination was
working properly since I saw in the list that other people had some
difficulties with it.

Daniel

The bootstrap indexing actually ended up taking twice the amount of
time listed below. When there was no index directory and I made the
call to rebuild_index, the ferret_index.log file had these lines in
it:

Logfile created on Thu Jun 07 08:46:34 -0400 2007 by logger.rb/1.5.2.9

rebuild index: []
reindexing model CurrentProgram
reindex model CurrentProgram : 0.00% complete : 25658.57 secs to finish

when it hit 100%, the following lines appeared:
reindex model CurrentProgram : 99.56% complete : 219.29 secs to finish
Created Ferret index in:
./script/…/config/…/config/…/index/production/current_program
rebuild index: [[“CurrentProgram”]]
reindexing model CurrentProgram
reindex model CurrentProgram : 0.00% complete : 25740.65 secs to finish
reindex model CurrentProgram : 0.95% complete : 26065.95 secs to finish

So it looks like for some reason, it performed the rebuild twice. :frowning:
When I looked at it this morning, it had over 116k files in the
current_program directory. Not the most healthy thing. I ran
CurrentProgram.aaf_index.ferret_index.optimize and it took a few
minutes and fully optimized down to three files.

I made the testing patch suggested and am running now. I did not
delete the index directory. The ferret_index.log started out with
these lines:
rebuild index: [[“CurrentProgram”]]
reindexing model CurrentProgram
reindex model CurrentProgram : 0.00% complete : 3540.78 secs to finish
reindex model CurrentProgram : 0.95% complete : 3510.69 secs to finish

So it is a significantly shorter time when it isn’t actually adding
the doc to the index.

If you have any further ideas on things to try or any other
information you’d like to collect, please let me know. In the
meantime, I’m going to try out the acts_as_solr plugin since I’ve had
a bit more experience with tuning solr and see what the indexing
performance on that looks like.

Daniel