On 17.08.2006 15:54, Guillaume Marcais wrote: > I have a script that aggregates data from multiple file, store it all in > a hash, and then emit a summary on standard input. The input files (text > files) are fairly big, like 4 of about 50Mb and 4 of about 350Mb. The > hash will grow to about 500 000 keys. The memory footprint of the ruby > process as reported by top is above 2 Gigs. > > When the script start, it processes the files at a speed of 10K/s or so. > Not lightening fast, but will get the job done. As time goes on, the > speed drops down to 100 bytes/s or less, while still taking 100% CPU > time. Unbearable. The machine it is running on is pretty good: 4xAMD > Opteron 64bit, 32G memory, local scsi raided drive. > > Does the performance of Ruby collapse past a certain memory usage? Like > the GC kicks in all the time. > > Any clue on how to speed this up? Any help appreciated. > > Guillaume. > > > The code is as followed: > > delta and snps are IOs. reads is a hash. max is an integer (4 in my case). > It expects a line starting with a '>' on delta. Then it reads some > information on delta (and discard the rest) and some more information on > snps (if present). All this is then recorded in the reads hash file. > Each entry entry in the hash are arrays with the 4 best match found so far. > > def delta_reorder(delta, snps, reads, max = nil) > l = delta.gets or return > snps_a = nil > loop do > l =~ /^>(\S+)\s+(\S+)/ or break > contig_name, read_name = $1, $2 Small optimization, which will help only if delta_reorder is called ofen: read = (reads[read_name.freeze] ||= []) Background: a Hash will dup a non frozen string to avoid nasty effects if the original changes. <snip/> To make people's lives who want to play with this easier you could provide a complete test set (original script + data files). I don't fully understand your processing but maybe there's an option to improve this algorithm wise. Kind regards robert