Writing to ferret index from multiple processes


#1

Hi,

what do I have to do to be able to write a ferret index from multiple
processes at the same time?

I was indexing a lot of documents with a script when another process
made a change to the index; suddenly all of the imported data was gone
from the index, and the import script quit with the exception
“Errno::ENOENT: No such file or directory - ./ferret_index/_1ah.fnm”.

auto_flush => true didn’t help. Is there something else?

Andreas


#2

Hi Andreas,

Can you show me some more code? How are you creating the index?
Perhaps you are setting :create => true in which case it will
overwrite the old index.

Dave


#3

David B. wrote:

Hi Andreas,

Can you show me some more code? How are you creating the index?
Perhaps you are setting :create => true in which case it will
overwrite the old index.

Dave

Oops. I am indeed using :create => true. I forgot that I set it because
create_if_missing did not work.

I removed it, but now there is a different problem. When I change the
index while the indexing script is running, it quits, but with another
error message:

316
317
318
RuntimeError: docs out of order curent doc = 9 and previous doc = 17
from
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/segment_merger.rb:276:in
append_postings' from /usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/segment_merger.rb:262:inappend_postings’
from
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/segment_merger.rb:240:in
merge_term_info' from /usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/segment_merger.rb:215:inmerge_term_infos’
from
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/segment_merger.rb:176:in
merge_terms' from /usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/segment_merger.rb:48:inmerge’
from
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:403:in
merge_segments' from /usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:371:inmaybe_merge_segments’
from
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:161:in
add_document' from /usr/local/lib/ruby/1.8/monitor.rb:229:insynchronize’
from
/usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:159:in
add_document' from /usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:270:in<<’
from /usr/local/lib/ruby/1.8/monitor.rb:229:in synchronize' from /usr/local/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in<<’
from ./app/models/search_ferret.rb:38:in `update’
from (irb):1


#4

I’m not to sure about this one. Are you by any chance explicitely
deleting the lock files when your app starts up? I’ve seen a few
people do that. The only way I can see doc numbers getting out of
order is if you delete the lock files. Any chance I could look at more
of your code? Is this for RForum? Perhaps I could check it out of svn.
Anyway, I hope I can help you out with this.

Dave

PS: If you are interested you should join the Ferret mailing list. You
seem to be doing some more advanced stuff judging from the bugs you’re
finding. :wink:


#5

David B. wrote:

I’m not to sure about this one. Are you by any chance explicitely
deleting the lock files when your app starts up?

No.

I’ve seen a few
people do that. The only way I can see doc numbers getting out of
order is if you delete the lock files. Any chance I could look at more
of your code? Is this for RForum? Perhaps I could check it out of svn.

It is for RForum. You can see the the code here:
http://rforum.andreas-s.net/trac/file/trunk/app/models/search_ferret.rb

My indexing script simply fetches all the posts from the database and
calls Post.search_handler.update(post) for each one. If another process
calls the update method while this script is running, I am getting the
exception. If you need more information to reproduce the problem, please
let me know.

PS: If you are interested you should join the Ferret mailing list. You
seem to be doing some more advanced stuff judging from the bugs you’re
finding. :wink:

I didn’t know there was a list. I will definetely join it.

Thanks for fixing the other bugs so quickly.

Andreas


#6

David B. wrote:

Hey Andreas,

The latest version of RForum still has :create => true so I’m guessing
you haven’t checked in your latest changes. Could you let me know when
you have?

I have checked it in.


#7

Andreas S. wrote:

David B. wrote:

Hey Andreas,

The latest version of RForum still has :create => true so I’m guessing
you haven’t checked in your latest changes. Could you let me know when
you have?

I have checked it in.

Btw, I tried it again on another machine, and couldn’t reproduce the
“docs out of order” exception, but instead I got
RuntimeError: could not obtain lock:
./ferret_index/ferret-f62496686e637eca67e933a9cdc5eb21write.lock


#8

Hey Andreas,

The latest version of RForum still has :create => true so I’m guessing
you haven’t checked in your latest changes. Could you let me know when
you have?

Cheers,
Dave


#9

Hi Andreas,

This is what I would expect to happen. What machine where you running
it on the first time. Whatever it was, Ferret’s locking mechanism must
not work.

Anyway, to avoid this problem you need to make sure the batch process
doesn’t keep the lock for too long (about 5 seconds). I would change
the rebuild index method to use an IndexWriter or switch auto_flush to
false. This should speed the reindexing up. I’d also add a pause in
there so other processes can get a hold of the lock if they need to.
Since you are flushing explicitly you may as well set auto_flush to
false anyway.

def index
@index ||= Index::Index.new(:path => @path,
#:auto_flush =>true <= don’t use this
anymore
:default_search_field => [‘subject’],
:key => [‘id’, ‘class’])
end

update will continue to work, handling the flushing explicitly

def update(post)
index << create_doc(post)
index.flush
end

batch_update will keep the IndexWriter open between updates

so it will run much faster

def batch_update(post)
index << create_doc(post)
end

define a flush method for use with the batch_update method

def flush
index.flush
end

Then in your process that is doing the reindex I’d use the
batch_update method and I might even add some pauses in there.
Something like this;
MAX_ADDS_BEFORE_FLUSH = 10
def rebuild_index
i = 0
Post.find_all_by_deleted(0).each do |post|
self.update(post)
i += 1
if (i % MAX_ADDS_BEFORE_FLUSH) == 0
self.flush
sleep(0.5)
end
end
end

These are just ideas. You’ll probably come up with something better. I
think the best solution is just to keep the Ferret index in sync with
the database so that you don’t need to reindex everything.

Let me know what kind of system you were running it on the first time
to get the documents out of order error. I’ll see if I can find out
why the locking wasn’t working.

Cheers,
Dave