Ferret progress update

david · February 22, 2007, 6:15am

Hi folks,

Just thought I better let you all know that I’m still working on the
next release of Ferret. I’ve been working the last 7 days doing
nothing but Ferret development. The last iteration generated a diff of
almost 5000 lines so there are some pretty major changes. Most people
won’t notice these changes however as the API remains unchanged. But
if you were having problems with FileNotFound errors or other types of
segmentation faults the next version should fix most of them.

I’m now going to go through the mailing list and the Trac bug reports
to fix any other small problems laying around before I release the
next version. Coming soon…

david · February 22, 2007, 10:29am

On Thu, Feb 22, 2007 at 04:05:05PM +1100, David B. wrote:

Hi folks,

Just thought I better let you all know that I’m still working on the
next release of Ferret. I’ve been working the last 7 days doing
nothing but Ferret development. The last iteration generated a diff of
almost 5000 lines so there are some pretty major changes. Most people
won’t notice these changes however as the API remains unchanged. But
if you were having problems with FileNotFound errors or other types of
segmentation faults the next version should fix most of them.

You rock

cheers,
Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

david · February 22, 2007, 1:01pm

Thanks Dave! Looking forward to it.

Can you tell us a bit more about what led to the segfault error cropping
up? Have they been in the 0.10 branch all along? 0.9 too? Or did some
new work break something?

Maybe it will help others debug problems in future.

John.

http://johnleach.co.uk

david · February 23, 2007, 12:58pm

Hi Dave,

interesting stuff. Apparently you can tell the GC not to mess with your
stuff using rb_gc_register_address (and rb_gc_unregister_address when/if
you’re done).

Looking at gc.c, all it does is add the pointer to the GC’s list of
things that are being used, so it won’t free it.

an example from the Ruby source (showing it being used before object
creation): ext/iconv/iconv.c

rb_gc_register_address(&charset_map);
charset_map = rb_hash_new();
rb_define_singleton_method(rb_cIconv, “charset_map”, charset_map_get,
0);

I guess you can register before filling the array, set the length, then
unregister. Not sure if this actually locks all the values in the array
though If not, perhaps you could overwrite the mark function for the
array and restore it afterwards, heh. Perhaps not worth the fiddling.

I’m no Ruby extension expert though, so beware

John.

http://johnleach.co.uk

On Fri, 2007-02-23 at 17:11 +1100, David B. wrote:

I wasn’t locking the commit log in all the places I should have. This
actually would have been very easy to fix if someone had supplied a
repeatable test case. In the end though I decided to lock-less
commits, a new feature that has recently been added to Lucene. The
main advantages of this are that you can open IndexReaders when an
IndexWriter is committing and you can open multiple IndexReaders at a
time without them interrupting each other. It also makes it much
easier to recover after a crash. If your system crashes in the middle
of a commit then Ferret will be able to open the previously committed
version of the index.

david · February 23, 2007, 9:46am

On 2/22/07, John L. [email protected] wrote:

Thanks Dave! Looking forward to it.

Can you tell us a bit more about what led to the segfault error cropping
up? Have they been in the 0.10 branch all along? 0.9 too? Or did some
new work break something?

Maybe it will help others debug problems in future.

Well, the main problem I fixed was due to an error introduced in 0.10.
I wasn’t locking the commit log in all the places I should have. This
actually would have been very easy to fix if someone had supplied a
repeatable test case. In the end though I decided to lock-less
commits, a new feature that has recently been added to Lucene. The
main advantages of this are that you can open IndexReaders when an
IndexWriter is committing and you can open multiple IndexReaders at a
time without them interrupting each other. It also makes it much
easier to recover after a crash. If your system crashes in the middle
of a commit then Ferret will be able to open the previously committed
version of the index.

As for the segfaults, I think I finally found the problem today. To
improve the performance of Ferret’s bindings I was adding objects to
Ruby’s Array directly instead of using the rb_ary_push method. Some of
these arrays are quite large so using rb_ary_push was a lot of
overhead which I didn’t think was really necessary … but I didn’t
quite get it right. For example, I had;

rterms = rb_ary_new2(term_cnt);
rts = RARRAY(rterms)->ptr;
RARRAY(rterms)->len = term_cnt;
for (i = 0; i < term_cnt; i++) {
    rts[i] = frt_get_tv_term(&terms[i]);
}

So, in this example, the number of terms in a field can be very large
and we save a lot of time[1] by setting the C array directly rather
than use rb_ary_push. The problem occurs when the garbage collector
gets called in the middle of filling the array. It will try and mark
all of the objects contained by the array but the array isn’t filled
yet so many of its elements haven’t been set yet. What I should have
done was incremented the array length as I went.

rterms = rb_ary_new2(term_cnt);
rts = RARRAY(rterms)->ptr;
for (i = 0; i < term_cnt; i++) {
    rts[i] = frt_get_tv_term(&terms[i]);
    RARRAY(rterms)->len++;
}

This is touch slower than the original code but it now works so that’s
all that matters. You may be thinking I could have just set the length
after the loop.

rterms = rb_ary_new2(term_cnt);
rts = RARRAY(rterms)->ptr;
for (i = 0; i < term_cnt; i++) {
    rts[i] = frt_get_tv_term(&terms[i]);
}
RARRAY(rterms)->len = term_cnt;

But the problem here is that the elements that have been added to the
array won’t actually get marked by the garbage collector because the
array’s length is still 0 so the could incorrectly be collected, thus
also causing a segfault. One alternate method that will work would be
to user rb_mem_clear():

rterms = rb_ary_new2(term_cnt);
rb_mem_clear(rterms, term_cnt);  // initialize all elements to nil
rts = RARRAY(rterms)->ptr;
RARRAY(rterms)->len = term_cnt;
for (i = 0; i < term_cnt; i++) {
    rts[i] = frt_get_tv_term(&terms[i]);
}

This makes sure all elements are set to nil before the are set to the
term vector so they are therefor safe from the garbage collector.

Anyway, sorry for the long and boring post. I guess the point is to
think about how the garbage collector works when developing ruby
bindings.

Cheers,
Dave

[1] How much faster? About 20% faster according to a simple benchmark
I just ran. Was it worth the segfaults? Of course not but in a library
like this you take the optimizations where you can get them.