I have got my database in Mysql. I used ferret to index a table with 10 million rows. On limiting the selection of data to 1000 initial retrieval, it takes 200 seconds but for the whole table it took more than four hours and after which i had to close my indexing application. I used the StandardAnalyser for it. There is no problem from the database side as retrieval of all the data in the table into a document takes very less time. So please suggest me ways to overcome it.
on 2007-07-14 13:14
on 2007-07-14 16:41
Hi, here's what I did to speed up index rebuilds: 1. Database ------------------ Use raw SQL to retrieve objects from the DB with ModelClass.connection.select_all() instead of Model.find_by_sql(). This will prevent ActiveRecord objects from being instantiated for every row, which is fairly expensive. Instead of ActiveRecord objects, select_all() returns hashes with string keys. You need to access the values with object['name'] instead of object.name. Another consequence is that helper methods such as def full_name last_name + ' ' + first_name end are not available anymore. You have to use SQL for doing this: "SELECT ..., CONCAT(first_name, ' ', last_name) AS full_name, ...". Another important point are associations. If you index objects with a has_many association, the objects themselves are loaded in a batch statement but the associateded objects are loaded separately. Assume, for example, you index Articles which have many Comments. If the Articles are loaded in batches of 1000, there will be another 1000 statements for loading the comments. With ActiveRecord find() you can use the :include option as a remedy. With raw SQL you have to use joins and grouping to get all the data you need in one row. Hint: have a look at the group functions that your RDBMS supports. Select only the columns that you need for indexing. Instead of "SELECT products.*" use "SELECT product.name, product.description,...". This is especially important if your table includes BLOB or CLOB columns. Bottom line: Brush up your SQL skills and do as much work as possible inside the database. ActiveRecord is definitely not made for batch processing, especially when dealing with millions of rows. 2. Analyzers ------------------ Use fast analyzers if possible. StandardAnalyzer is fine for most purposes but it includes HyphenFilter and StopFilter which are slow. Consider using lightweight LetterAnalyzer or WhiteSpaceAnalyzer or even no analyzer at all for particular fields. You can set individual analyzers for different fields using PerFieldAnalyzer. Don't be reluctant to write your own custom analyzer. I'm using an analyzer written in pure Ruby based on regular expressions. For its particular purpose, it's as fast as StandardAnalyzer but yields better results. Text processing in Ruby is actually pretty fast. 3. Ferret ------------ Tweak the Index parameters for rebuilds. See the Ferret documentation on what you can do. Here's an example i = Ferret::Index::Index.new( ..., :max_buffer_memory => 0x8000000, :max_buffered_docs => 0x10000, :merge_factor => 0x10, :term_index_interval => 0x40, :doc_skip_interval => 0x8 ) Experiment with these values, but be careful! Setting them arbitrarily can cause Ferret to crash. Start with the values above and try to tweak one at time. This is the last measure in optimization. Don't bother fiddling with these parameters unless you get steps 1. and 2. right. Hope this helps. Cheers. Andreas