Forum: Ferret performance bottleneck

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Ad53a28016efc40a5d9d7008ac1a6f3c?d=identicon&s=25 golak Sarangi (Guest)
on 2007-07-14 13:14
(Received via mailing list)
I have got my database in Mysql. I used ferret to index a table with 10
million rows. On limiting the selection of data to 1000 initial
retrieval,
it takes 200 seconds but for the whole table it took more than four
hours
and after which i had to close my indexing application. I used the
StandardAnalyser for it. There is no problem from the database side as
retrieval of all the data in the table into a document takes very less
time.
So please suggest me ways to overcome it.
B6a4a3523b3053cb138ee3b986128f15?d=identicon&s=25 Andreas Korth (Guest)
on 2007-07-14 16:41
(Received via mailing list)
Hi,

here's what I did to speed up index rebuilds:

1. Database
------------------
Use raw SQL to retrieve objects from the DB with
ModelClass.connection.select_all() instead of Model.find_by_sql().
This will prevent ActiveRecord objects from being instantiated for
every row, which is fairly expensive.

Instead of ActiveRecord objects, select_all() returns hashes with
string keys. You need to access the values with object['name']
instead of object.name. Another consequence is that helper methods
such as

def full_name
   last_name + ' ' + first_name
end

are not available anymore. You have to use SQL for doing this:

"SELECT ..., CONCAT(first_name, ' ', last_name) AS full_name, ...".

Another important point are associations. If you index objects with a
has_many association, the objects themselves are loaded in a batch
statement but the associateded objects are loaded separately.

Assume, for example, you index Articles which have many Comments. If
the Articles are loaded in batches of 1000, there will be another
1000 statements for loading the comments.

With ActiveRecord find() you can use the :include option as a remedy.
With raw SQL you have to use joins and grouping to get all the data
you need in one row. Hint: have a look at the group functions that
your RDBMS supports.

Select only the columns that you need for indexing. Instead of
"SELECT products.*" use "SELECT product.name,
product.description,...". This is especially important if your table
includes BLOB or CLOB columns.

Bottom line: Brush up your SQL skills and do as much work as possible
inside the database. ActiveRecord is definitely not made for batch
processing, especially when dealing with millions of rows.


2. Analyzers
------------------
Use fast analyzers if possible. StandardAnalyzer is fine for most
purposes but it includes HyphenFilter and StopFilter which are slow.
Consider using lightweight LetterAnalyzer or WhiteSpaceAnalyzer or
even no analyzer at all for particular fields. You can set individual
analyzers for different fields using PerFieldAnalyzer.

Don't be reluctant to write your own custom analyzer. I'm using an
analyzer written in pure Ruby based on regular expressions. For its
particular purpose, it's as fast as StandardAnalyzer but yields
better results. Text processing in Ruby is actually pretty fast.


3. Ferret
------------
Tweak the Index parameters for rebuilds. See the Ferret documentation
on what you can do. Here's an example

i = Ferret::Index::Index.new(
    ...,
    :max_buffer_memory => 0x8000000,
    :max_buffered_docs => 0x10000,
    :merge_factor => 0x10,
    :term_index_interval => 0x40,
    :doc_skip_interval => 0x8
)

Experiment with these values, but be careful! Setting them
arbitrarily can cause Ferret to crash. Start with the values above
and try to tweak one at time.

This is the last measure in optimization. Don't bother fiddling with
these parameters unless you get steps 1. and 2. right.

Hope this helps.

Cheers.
Andreas
This topic is locked and can not be replied to.