I have a script that aggregates data from multiple file, store it all
in a hash, and then emit a summary on standard input. The input files
(text files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb.
The hash will grow to about 500 000 keys. The memory footprint of the
ruby process as reported by top is above 2 Gigs.
When the script start, it processes the files at a speed of 10K/s or
so. Not lightening fast, but will get the job done. As time goes on,
the speed drops down to 100 bytes/s or less, while still taking 100%
CPU time. Unbearable. The machine it is running on is pretty good:
4xAMD Opteron 64bit, 32G memory, local scsi raided drive.
Does the performance of Ruby collapse past a certain memory usage? Like
the GC kicks in all the time.
Any clue on how to speed this up? Any help appreciated.
Guillaume.
The code is as followed:
delta and snps are IOs. reads is a hash. max is an integer (4 in my
case).
It expects a line starting with a ‘>’ on delta. Then it reads some
information on delta (and discard the rest) and some more information
on snps (if present). All this is then recorded in the reads hash file.
Each entry entry in the hash are arrays with the 4 best match found so
far.
def delta_reorder(delta, snps, reads, max = nil)
l = delta.gets or return
snps_a = nil
loop do
l =~ /^>(\S+)\s+(\S+)/ or break
contig_name, read_name = $1, $2
read = (reads[read_name] ||= [])
loop do
l = delta.gets or break
l[0] == ?> and break
cs, ce, rs, re, er = l.scan(/\d+/)
er_i = er.to_i
cs && ce && rs && re && er or break
l = delta.gets while l && l != “0\n”
if snps
snps_a = []
er_i.times { l << snps.gets or break; snps_a << l.split[-1] }
end
score = (re.to_i - rs.to_i).abs - 6 * er_i
if max
i = read.bsearch_upper_boundary { |x| score <=> x[1] }
read.insert(i, [contig_name, score, cs, ce, rs, re, er,
snps_a])
read.slice!(max…-1) if read.size > max
if read.size >= max
min = read.min { |x, y| x[1] <=> y[1] }
if score > min[1]
min.replace([contig_name, score, cs, ce, rs, re, er,
snps_a])
end
else
read << [contig_name, score, cs, ce, rs, re, er, snps_a]
end
else
if !read[0] || score > read[0][1]
read[0] = [contig_name, score, cs, ce, rs, re, er, snps_a]
end
end
end
end
end
Example of data (comments after # are mine, not present in file):
read_name (hash key) is gnl|ti|379331986
gi|56411835|ref|NC_004353.2| gnl|ti|379331986 1281640 769
246697 246940 722 479 22 22 0 # Keep this info. Collect 22 lines
from snps IO
0 # Skip
440272 440723 156 617 41 41 0 # Keep this info. Collect 41 lines
from snps IO
147 # Skip 'til 0
-22
-206
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
441263 441492 384 152 17 17 0 # Keep. Collect lines from snps.
-44 # Skip 'til 0
-1
-1
-1
37
0
gi|56411835|ref|NC_004353.2| gnl|ti|379331989 1281640 745 # and so
forth…
453805 453934 130 1 8 8 0
0