Le 17 ao?t 06, ? 11:16, Francis Cianfrocca a ?crit :

> On 8/17/06, Guillaume Marcais <guslist / free.fr> wrote:
>>
>> I have a script that aggregates data from multiple file, store it all
>> in a hash, and then emit a summary on standard input. The input files
>> (text files) are fairly big, like 4 of about 50Mb and 4 of about 
>> 350Mb.
>> The hash will grow to about 500 000 keys. The memory footprint of the
>> ruby process as reported by top is above 2 Gigs.
>
>
>
> This is a perfect example of what I've noticed many times: Ruby's
> performance is perfectly fast and acceptable until the working set 
> gets a
> certain (not terribly large) size, then it falls off a cliff. GC 
> perhaps has
> something to do with it, but I suspect that's only a small part of the
> problem.
>

Is there any tuning of GCC so it kicks in less frequently when the 
memory consumption is large? Or could it be that the Hash algorithm 
chokes when the number of keys gets large (like > 100_000)?

I guess I could re-implement in C or C++, but it saddens me because it 
is a quick script for (probably) a one time task and doing string 
processing is so much easier in Ruby. I'll try in Perl first.

> Before I spend a lot of time understanding your algorithm: is there a
> specific reason why you need to keep the entire set in memory? You say
> you're generating summary output at the end of the run, but can you
> accumulate your summary in an object of fixed size?

No, I can't (at least I don't see how). I have n reads split into ns 
files and m contigs split into ms files. Then all ns read files are 
match against all ms contig files, which leads to ns * ms match (or 
delta) files. And each reads may have one or many match with any of the 
contigs.

I want to extract for each read the 4 "best" matches. (the score is 
defined as the length of the match minus 6 times the number of errors 
in this match (we are talking about DNA alignment, so a match can have 
a small number of mismatches, due to sequencing errors, mutations, 
etc.)). The snps here are just added information, not crucial, but 
probably adds to the problem because it increases the size of the data 
set. It is the exact position of all the single mismatches inside of a 
match. The number of line (one mismatch per line) is given by the 
sub-header in the delta file.

So for every set of reads, I parse the ms delta files corresponding and 
extract for the best 4 match for each read. The routing showed before 
parses only one file. The same hash is used for all ms files.

>
> Or does your summary depend on some kind of transformation on the 
> entire set
> (like a numeric sort)? If so, there are things you can do to improve 
> the
> algorithm so you're not keeping a large working set in memory.
>

Maybe. I haven't seen it yet. And I went first with the dumb algorithm 
as it is an ad-hock script, not part of some great piece of software.

Thanks,
Guillaume.