Le 17 ao 06, 10:45, Robert Klemme a rit :

> On 17.08.2006 15:54, Guillaume Marcais wrote:
>> I have a script that aggregates data from multiple file, store it all 
>> in a hash, and then emit a summary on standard input. The input files 
>> (text files) are fairly big, like 4 of about 50Mb and 4 of about 
>> 350Mb. The hash will grow to about 500 000 keys. The memory footprint 
>> of the ruby process as reported by top is above 2 Gigs.
>> When the script start, it processes the files at a speed of 10K/s or 
>> so. Not lightening fast, but will get the job done. As time goes on, 
>> the speed drops down to 100 bytes/s or less, while still taking 100% 
>> CPU time. Unbearable. The machine it is running on is pretty good: 
>> 4xAMD Opteron 64bit, 32G memory, local scsi raided drive.
>> Does the performance of Ruby collapse past a certain memory usage? 
>> Like the GC kicks in all the time.
>> Any clue on how to speed this up? Any help appreciated.
>> Guillaume.
>> The code is as followed:
>> delta and snps are IOs. reads is a hash. max is an integer (4 in my 
>> case).
>> It expects a line starting with a '>' on delta. Then it reads some 
>> information on delta (and discard the rest) and some more information 
>> on snps (if present). All this is then recorded in the reads hash 
>> file.
>> Each entry entry in the hash are arrays with the 4 best match found 
>> so far.
>> def delta_reorder(delta, snps, reads, max = nil)
>>   l = delta.gets or return
>>   snps_a = nil
>>   loop do
>>     l =~ /^>(\S+)\s+(\S+)/ or break
>>     contig_name, read_name = $1, $2
>
> Small optimization, which will help only if delta_reorder is called 
> ofen:
>
>     read = (reads[read_name.freeze] ||= [])
>
> Background: a Hash will dup a non frozen string to avoid nasty effects 
> if the original changes.
>
> <snip/>
>
> To make people's lives who want to play with this easier you could 
> provide a complete test set (original script + data files).

Will do, when I get to my office.

Guillaume.

>
> I don't fully understand your processing but maybe there's an option 
> to improve this algorithm wise.
>
> Kind regards
>
> 	robert
>
>