I have a script that aggregates data from multiple file, store it all 
in a hash, and then emit a summary on standard input. The input files 
(text files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb. 
The hash will grow to about 500 000 keys. The memory footprint of the 
ruby process as reported by top is above 2 Gigs.

When the script start, it processes the files at a speed of 10K/s or 
so. Not lightening fast, but will get the job done. As time goes on, 
the speed drops down to 100 bytes/s or less, while still taking 100% 
CPU time. Unbearable. The machine it is running on is pretty good: 
4xAMD Opteron 64bit, 32G memory, local scsi raided drive.

Does the performance of Ruby collapse past a certain memory usage? Like 
the GC kicks in all the time.

Any clue on how to speed this up? Any help appreciated.

Guillaume.


The code is as followed:

delta and snps are IOs. reads is a hash. max is an integer (4 in my 
case).
It expects a line starting with a '>' on delta. Then it reads some 
information on delta (and discard the rest) and some more information 
on snps (if present). All this is then recorded in the reads hash file.
Each entry entry in the hash are arrays with the 4 best match found so 
far.

def delta_reorder(delta, snps, reads, max = nil)
   l = delta.gets or return
   snps_a = nil
   loop do
     l =~ /^>(\S+)\s+(\S+)/ or break
     contig_name, read_name = $1, $2
     read = (reads[read_name] ||= [])
     loop do
       l = delta.gets or break
       l[0] == ?> and break
       cs, ce, rs, re, er = l.scan(/\d+/)
       er_i = er.to_i
       cs && ce && rs && re && er or break
       l = delta.gets while l && l != "0\n"
       if snps
         snps_a = []
         er_i.times { l << snps.gets or break; snps_a << l.split[-1] }
       end
       score = (re.to_i - rs.to_i).abs - 6 * er_i
       if max
#         i = read.bsearch_upper_boundary { |x| score <=> x[1] }
#         read.insert(i, [contig_name, score, cs, ce, rs, re, er, 
snps_a])
#         read.slice!(max..-1) if read.size > max
         if read.size >= max
           min = read.min { |x, y| x[1] <=> y[1] }
           if score > min[1]
             min.replace([contig_name, score, cs, ce, rs, re, er, 
snps_a])
           end
         else
           read << [contig_name, score, cs, ce, rs, re, er, snps_a]
         end
       else
         if !read[0] || score > read[0][1]
           read[0] = [contig_name, score, cs, ce, rs, re, er, snps_a]
         end
       end
     end
   end
end


Example of data (comments after # are mine, not present in file):

# read_name (hash key) is gnl|ti|379331986
 >gi|56411835|ref|NC_004353.2| gnl|ti|379331986 1281640 769
246697 246940 722 479 22 22 0    # Keep this info. Collect 22 lines 
from snps IO
0                                # Skip
440272 440723 156 617 41 41 0    # Keep this info. Collect 41 lines 
from snps IO
147                              # Skip 'til 0
-22
-206
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
441263 441492 384 152 17 17 0   # Keep. Collect lines from snps.
-44                             # Skip 'til 0
-1
-1
-1
37
0
 >gi|56411835|ref|NC_004353.2| gnl|ti|379331989 1281640 745 # and so 
forth...
453805 453934 130 1 8 8 0
0