Term vector blues

pandemic · February 17, 2007, 12:01am

I have a lot of crashes when I try to use term vectors. Here’s an
example, which crashes pretty consistently. This problem seems to be
somewhat sensitive to platform… people on other OS’s and ruby versions
have reported no error. I have seen this with ferret 0.10.13 and 0.10.14
on debian stable using ruby 1.8.2, but I have observed the same problem
on various other systems as well. I’ve reported this issue here before,
but it was when David was gone.

program:

require ‘rubygems’
require ‘ferret’
#require ‘zlib’

fields=Ferret::Index::FieldInfos.new
fields.add_field :text, :store=>:no#, :index=>:omit_norms
i = Ferret::I.new :field_infos=>fields #:path=>‘temp_index’

20.times{
i << {:text=>man gcc[0…135000]}
}
#i.close_writer
r=i.reader
#r.term_docs_for(:text, “example”)

r.term_vector(0,:text)

example output:

$ ruby tvtest.rb
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
Reformatting gcc(1), please wait…
tvtest.rb:16: [BUG] Segmentation fault
ruby 1.8.2 (2005-04-11) [i386-linux]

Aborted

pandemic · February 19, 2007, 9:58am

On Fri, Feb 16, 2007 at 02:52:28PM -0800, Caleb C. wrote:

require ‘rubygems’
}
#i.close_writer
r=i.reader
#r.term_docs_for(:text, “example”)

r.term_vector(0,:text)

[…]

tvtest.rb:16: [BUG] Segmentation fault
ruby 1.8.2 (2005-04-11) [i386-linux]

Aborted

same here with Ubuntu 6.10 / Ruby 1.8.4.

Jens

–
Jens Krämer
webit! Gesellschaft für neue Medien mbH
Schnorrstraße 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
[email protected] | www.webit.de

Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

pandemic · February 19, 2007, 10:47am

tvtest.rb:16: [BUG] Segmentation fault
ruby 1.8.2 (2005-04-11) [i386-linux]

Aborted

same here with Ubuntu 6.10 / Ruby 1.8.4.

no problem on MacOSX 10.4, ruby 1.8.5, ferret (0.10.14)

pandemic · February 19, 2007, 10:04pm

Benjamin K. wrote:

tvtest.rb:16: [BUG] Segmentation fault
ruby 1.8.2 (2005-04-11) [i386-linux]

Aborted

same here with Ubuntu 6.10 / Ruby 1.8.4.

Same on cygwin / Ruby 1.8.5, BUT if I turn off garbage collection
(GC.disable) it doesn’t crash.

I think this is related to:

http://rubyforge.org/pipermail/ferret-talk/2007-February/002504.html

and others… which David said he is working on.

The following script always seems to die at the same point on my machine
and may provide some extra insight.

require ‘rubygems’
require ‘ferret’

fields = Ferret::Index::FieldInfos.new
fields.add_field :text, :store => :no

#GC.disable

s = man gcc

ix = 0
s.scan(/./m) do |c|
puts “#{ix}: #{c}”
i = Ferret::I.new :field_infos => fields
i << {:text => s[0…ix+=1]}
tv = i.reader.term_vector(0, :text)
end

Dies on character 357 on my machine…

Cheers!
Patrick

pandemic · September 25, 2007, 11:07pm

On 2/25/07, Caleb C. [email protected] wrote:

I’m afraid this is one of those really nassty c pointer bugs that is
difficult to solve because the problem is very far away from the code
that actually fails. Perhaps this is a situation that calls for
valgrind, or something like that? Some kind of library or tool for
debugging memory allocation problems could help prove that the code
really really does operate the way the programmers think it should. I’ve
played with using valgrind to debug ferret before; I think it was in
connection with another bug. I couldn’t get very far in part because
ruby pre-allocates pools of objects which it manages itself. That
behavior would have to be disabled to permit valgrind to do its magic.

I use valgrind all the time. It is a godsend. Unfortunately it doesn’t
work too well with Ruby as ruby’s garbage collector raises so many
errors. I plan to make Ferret 2.0 scriptable so that more people can
get involved with Ferret development and I would have liked to have
used Ruby except for this problem with valgrind. Lua looks like a
better alternative and much lighter weight. Anyway, it is a long way
in the future so probably not worth mentioning.

Sticking lots of assertions into all the code that gets executed by the
failing test script is another thing to try, I guess. That really more
of a shot-gun approach… I’m sorry I can’t offer anything but these
really generic suggestions.

This is the method I end up having to use when debugging Ferret’s Ruby
bindings. Anyway, I probably should have posted it here but I did end
up solving the problem;

http://www.ruby-forum.com/topic/98709#210591

So term-vectors will be fine in the next release without introducing
any memory leaks.

Cheers,
Dave

pandemic · September 25, 2007, 11:04pm

Dave Balmain wrote:

After reading your first email I found the same things behavior as you
describe here. This is very frustrating because in this case I create
completely independent Ruby objects. They don’t reference the Ferret
data space at all so this was the last place I expected to have
garbage collection problems. It makes no sense to me at all that not

I’m so glad to hear it’s not just me!

freeing the offsets and positions arrays should make any difference at
all. If you have any more ideas with regard to this problem I’d love
to hear them as it has me a little stumped.

I’ve long thought this problem is a just a pointer gone rampaging, not
really a garbage collector issue so much, it’s just that that’s where it
shows up… the fact that disabling memory de-allocation code causes
it to go into remission suggested to me that it was in fact a memory
management issue. But it sounds like we both think the code that manages
those variables is correct. (It’s not code that I’m familiar with, so
it’s good that we concur on this point.)

A workaround, even one that causes a memory leak, is important progress
on this issue, as far as I’m concerned. It sounds like you came up with
an even better workaround; I’ll have to try it.

I’m afraid this is one of those really nassty c pointer bugs that is
difficult to solve because the problem is very far away from the code
that actually fails. Perhaps this is a situation that calls for
valgrind, or something like that? Some kind of library or tool for
debugging memory allocation problems could help prove that the code
really really does operate the way the programmers think it should. I’ve
played with using valgrind to debug ferret before; I think it was in
connection with another bug. I couldn’t get very far in part because
ruby pre-allocates pools of objects which it manages itself. That
behavior would have to be disabled to permit valgrind to do its magic.

Sticking lots of assertions into all the code that gets executed by the
failing test script is another thing to try, I guess. That really more
of a shot-gun approach… I’m sorry I can’t offer anything but these
really generic suggestions.