Memory leak in index build?

I have a script (below) which attempts to make an index out of all the
man pages on my system. It takes a while, mostly because it runs man
over and over… but anyway, as time goes on the memory usage goes up
and up and never down. Eventually, it runs out of ram and just starts
thrashing up the swap space, pretty much grinding to a halt.

The workaround would seem to be to index documents in batches in the
background, shutting down the index process every so often to recover
its memory. I’m about to try that, because I’m really hunting a
different bug… however, the memory problem concerns me.

require ‘rubygems’
require ‘ferret’
require ‘set’

dir = “temp_index”

if ARGV.first=="-p"
ARGV.shift
prefix=ARGV.shift
end

fi= Ferret::Index::FieldInfos.new
fi.add_field :name,
:index => :yes, :store => :yes, :term_vector => :with_positions

%w[data field1 field2 field3].each{|fieldname|
fi.add_field fieldname.to_sym,
:index => :yes, :store => :no, :term_vector => :with_positions
}

i = Ferret::Index::IndexWriter.new(:path=>dir, :create=>true,
:field_infos=>fi)

list=Dir["/usr/share/man//#{prefix}.gz"]
numpages=(ARGV.last||list.size).to_i

list[0…numpages].each{|manfile|
all,name,section=/\A(.).([^.]+)\Z/.match(File.basename(manfile,
“.gz”))
tttt=man #{section} #{name}.gsub(/.\b/m, ‘’)

i << {
:data=>tttt.to_s,
:name=>name,
:field1=>name,
:field2=>name,
:field3=>name,
}
}

i.close

i=Ferret::Index::IndexReader.new dir

i.max_doc.times{|n|
i.term_vector(n,:data).terms
.inject(0){|sum,tvt| tvt.positions.size } > 1_000_000 and
puts “heinous term count for #{i[n][:name]}”
}

seenterms=Set[]
begin
i.terms(:data).each{|term,df|
seenterms.include? term and next
i.term_docs_for(:data,term)
seenterms << term
}
rescue Exception
raise
end

On 3/13/07, Caleb C. [email protected] wrote:

1.8.5, I think) didn’t seem to have the same memory leak.

What version do you run, by the way?

I’m on 1.8.5.

Incidentally, I’m not sure what the other bug you are chasing is but
it may have something to do with the encoding of the man pages. I

I know the man output is some encoding I don’t understand; I’m just
trying to generate a lot of data to feed into ferret. I don’t care if
it’s correct. I’m still having quite a few crashes with ferret, though
the situation has improved. I’m trying to reproduce those without
handing you my entire codebase. So far, without success. :frowning:

Let me know when you do find the problem. It is possible that is has
something to do with a mismatch of encodings. Feeding ISO-8859-1 data
(which is what my man pages are encoded in) to a UTF-8 analyzer might
cause Ferret to crash. I’ve tried to fix this so that it doesn’t
happen but I might have missed something.

don’t think they are UTF-8 so if your locale is set to UTF-8 it will
cause some problems in the analysis.

I know I’m not on the UTF-8 locale. Actually, I’ve been trying to figure
out how to set my locale to UTF-8. I don’t suppose you’d know? I’m using
Debian stable.

It’s not too hard. Something like;

$ sudo apt-get install debconf
$ sudo dpkg-reconfigure locales

Cheers,
Dave

On 3/10/07, Caleb C. [email protected] wrote:

I have a script (below) which attempts to make an index out of all the
man pages on my system. It takes a while, mostly because it runs man
over and over… but anyway, as time goes on the memory usage goes up
and up and never down. Eventually, it runs out of ram and just starts
thrashing up the swap space, pretty much grinding to a halt.

Hey Caleb,

Running your test for 15 minutes my memory usage climbed to 30Mb. It
was still slowly climbing which is not a good sign but not enough to
bring my system to a halt. Anyway, I tried using valgrind’s memcheck
on it and I couldn’t find a leak in the Ferret code. Perhaps it is a
leak in your version of Ruby, although I doubt it. Here is the most
significant output from valgrind with --show-reachable=yes set;

==7636== 110,880 bytes in 6,930 blocks are still reachable in loss
record 15 of 20
==7636== at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636== by 0x40C175F: st_insert (st.c:288)
==7636== by 0x40D1E55: rb_ivar_set (variable.c:1056)
==7636== by 0x40D1FC2: rb_iv_set (variable.c:1959)
==7636== by 0x40D2003: rb_name_class (variable.c:282)
==7636== by 0x408BCBB: boot_defclass (object.c:2462)
==7636== by 0x408D020: Init_Object (object.c:2549)
==7636== by 0x40798A0: rb_call_inits (inits.c:54)
==7636== by 0x4061E5C: ruby_init (eval.c:1382)
==7636== by 0x8048600: main (in /usr/bin/ruby1.8)
==7636==
==7636==
==7636== 187,248 bytes in 11,703 blocks are still reachable in loss
record 16 of 20
==7636== at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636== by 0x40C184F: st_init_table_with_size (st.c:154)
==7636== by 0x40C18B6: st_init_strtable_with_size (st.c:193)
==7636== by 0x4095FBD: Init_sym (parse.y:5885)
==7636== by 0x4079896: rb_call_inits (inits.c:52)
==7636== by 0x4061E5C: ruby_init (eval.c:1382)
==7636== by 0x8048600: main (in /usr/bin/ruby1.8)
==7636==
==7636==
==7636== 514,228 bytes in 11,687 blocks are still reachable in loss
record 17 of 20
==7636== at 0x401F6D5: calloc (vg_replace_malloc.c:279)
==7636== by 0x40C1870: st_init_table_with_size (st.c:158)
==7636== by 0x40C1914: st_init_table (st.c:167)
==7636== by 0x40C196F: st_init_numtable (st.c:173)
==7636== by 0x40CFEB6: Init_var_tables (variable.c:28)
==7636== by 0x407989B: rb_call_inits (inits.c:53)
==7636== by 0x4061E5C: ruby_init (eval.c:1382)
==7636== by 0x8048600: main (in /usr/bin/ruby1.8)
==7636==
==7636==
==7636== 965,584 bytes in 60,349 blocks are still reachable in loss
record 18 of 20
==7636== at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636== by 0x40C1692: st_add_direct (st.c:307)
==7636== by 0x4095D1A: rb_intern (parse.y:6067)
==7636== by 0x40CFED7: Init_var_tables (variable.c:30)
==7636== by 0x407989B: rb_call_inits (inits.c:53)
==7636== by 0x4061E5C: ruby_init (eval.c:1382)
==7636== by 0x8048600: main (in /usr/bin/ruby1.8)
==7636==
==7636==
==7636== 1,088,800 bytes in 50,609 blocks are still reachable in loss
record 19 of 20
==7636== at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636== by 0x4074E50: ruby_xmalloc (gc.c:121)
==7636== by 0x40CF72F: ruby_strdup (util.c:634)
==7636== by 0x4095CFF: rb_intern (parse.y:6066)
==7636== by 0x40CFED7: Init_var_tables (variable.c:30)
==7636== by 0x407989B: rb_call_inits (inits.c:53)
==7636== by 0x4061E5C: ruby_init (eval.c:1382)
==7636== by 0x8048600: main (in /usr/bin/ruby1.8)
==7636==
==7636==
==7636== 2,374,520 bytes in 4 blocks are still reachable in loss record
20 of 20
==7636== at 0x4020396: malloc (vg_replace_malloc.c:149)
==7636== by 0x40737F9: add_heap (gc.c:351)
==7636== by 0x4061D74: ruby_init (eval.c:1372)
==7636== by 0x8048600: main (in /usr/bin/ruby1.8)

As you can see, non of this has anything to do with Ferret. If you
haven’t used valgrind before and you want to try it there, here is
how;

valgrind --leak-check=yes ruby calebs_test.rb 2> res

You’ll probably want to capture the output (like I have here) as it is
very long for ruby scripts. Lots of warnings from the ruby
internals. Let me know if you try this and you find anything unusual.

Incidentally, I’m not sure what the other bug you are chasing is but
it may have something to do with the encoding of the man pages. I
don’t think they are UTF-8 so if your locale is set to UTF-8 it will
cause some problems in the analysis.

Cheers,
Dave

On 3/13/07, Jonathan W. [email protected] wrote:

I think that a lot of people have been bitten by this and an explicit
configuration option IMHO make a lot of sense. With acts_as_ferret it
would look maybe like this

class A < ActiveRecrod::Base
acts_as_ferret :encoding => ‘utf8’
end

The problem is that this may give people the false impression that
Ferret will handle UTF-8 even when they don’t have a UTF-8 locale
installed. For example, adding this configuration option wouldn’t have
helped Caleb.

I guess one possibility would be to raise an exception if the locale
isn’t available. You could also automatically convert all text to
UTF-8 using iconv. I don’t know how much this would help but I would
certainly commit a patch along these lines if anyone is up for it.

Cheers,
Dave

It’s not too hard. Something like;

$ sudo apt-get install debconf
$ sudo dpkg-reconfigure locales

On the notion of the locale stuff, would it be possible to create a
configuration option that explicitly sets Ferret to UTF-8 mode?

I think that a lot of people have been bitten by this and an explicit
configuration option IMHO make a lot of sense. With acts_as_ferret it
would look maybe like this

class A < ActiveRecrod::Base
acts_as_ferret :encoding => ‘utf8’
end

Cheers,
Dave

Regards,
Jonathan

Dave Balmain said:

Running your test for 15 minutes my memory usage climbed to 30Mb. It
was still slowly climbing which is not a good sign but not enough to
bring my system to a halt. Anyway, I tried using valgrind’s memcheck
on it and I couldn’t find a leak in the Ferret code. Perhaps it is a
leak in your version of Ruby, although I doubt it. Here is the most
significant output from valgrind with --show-reachable=yes set;

Ok, so my ruby is version 1.8.2, kinda old, so maybe there is an old bug
in it. Recent experiments on another machine (running a newer ruby,
1.8.5, I think) didn’t seem to have the same memory leak.

What version do you run, by the way?

Incidentally, I’m not sure what the other bug you are chasing is but
it may have something to do with the encoding of the man pages. I

I know the man output is some encoding I don’t understand; I’m just
trying to generate a lot of data to feed into ferret. I don’t care if
it’s correct. I’m still having quite a few crashes with ferret, though
the situation has improved. I’m trying to reproduce those without
handing you my entire codebase. So far, without success. :frowning:

don’t think they are UTF-8 so if your locale is set to UTF-8 it will
cause some problems in the analysis.

I know I’m not on the UTF-8 locale. Actually, I’ve been trying to figure
out how to set my locale to UTF-8. I don’t suppose you’d know? I’m using
Debian stable.