On Fri, 2008-08-08 at 20:40 +0900, tobyclemson / gmail.com wrote:
> Hi all,
> 
> I'm having a really odd memory problem with a small ruby program I've
> written. It basically takes in lines from input files (which represent
> router flows), deduplicates them (based on elements of the line) and
> outputs the unique flows to file. The input file often contains over
> 300,000 lines of which about 25-30% are duplicates. The trouble I'm
> having is that the program (which is intended to be long running) does
> not seem to release any memory back to the system and in fact just
> increases in memory footprint from iteration to iteration. It should
> use about 150 MB by my estimates but sails through this and yesterday
> slowed to a halt at about 1.6GB (due to the GC by my guess). This
> makes no sense as at times I am deleting data structures that are 50MB
> each which should show some decrease in memory usage.
> 
> The codebase is slightly to big too big to pastie but it is available
> here http://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator .
> There are actually only 2 classes of importance and 1 script but I
> don't know if pastie can handle that.
> 
> Any help would be greatly appreciated as the alternative (pressures
> from above) is to rewrite in Python (which involves me learning
> Python)
> 
> Thanks in advance,
> Toby Clemson

Are you on a platform that has GNU "sort" available? GNU "sort" can do
the duplicate removal for you a *lot* more efficiently than a program in
*any* scripting language. Then you can use Ruby to do the "interesting"
part of the problem. :)
> 
-- 
M. Edward (Ed) Borasky
ruby-perspectives.blogspot.com

"A mathematician is a machine for turning coffee into theorems." --
Alfrƥd Rƥnyi via Paul Erd