On 17.08.2006 15:54, Guillaume Marcais wrote:
> I have a script that aggregates data from multiple file, store it all in 
> a hash, and then emit a summary on standard input. The input files (text 
> files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb. The 
> hash will grow to about 500 000 keys. The memory footprint of the ruby 
> process as reported by top is above 2 Gigs.
> 
> When the script start, it processes the files at a speed of 10K/s or so. 
> Not lightening fast, but will get the job done. As time goes on, the 
> speed drops down to 100 bytes/s or less, while still taking 100% CPU 
> time. Unbearable. The machine it is running on is pretty good: 4xAMD 
> Opteron 64bit, 32G memory, local scsi raided drive.
> 
> Does the performance of Ruby collapse past a certain memory usage? Like 
> the GC kicks in all the time.
> 
> Any clue on how to speed this up? Any help appreciated.
> 
> Guillaume.
> 
> 
> The code is as followed:
> 
> delta and snps are IOs. reads is a hash. max is an integer (4 in my case).
> It expects a line starting with a '>' on delta. Then it reads some 
> information on delta (and discard the rest) and some more information on 
> snps (if present). All this is then recorded in the reads hash file.
> Each entry entry in the hash are arrays with the 4 best match found so far.
> 
> def delta_reorder(delta, snps, reads, max = nil)
>   l = delta.gets or return
>   snps_a = nil
>   loop do
>     l =~ /^>(\S+)\s+(\S+)/ or break
>     contig_name, read_name = $1, $2

Small optimization, which will help only if delta_reorder is called ofen:

     read = (reads[read_name.freeze] ||= [])

Background: a Hash will dup a non frozen string to avoid nasty effects 
if the original changes.

<snip/>

To make people's lives who want to play with this easier you could 
provide a complete test set (original script + data files).

I don't fully understand your processing but maybe there's an option to 
improve this algorithm wise.

Kind regards

	robert