Sorry I don't quite understand the problem - I can see that it
probably is one but I think it's a matter of terminology. What do you
mean when you say destructively modified? I am modifying the value of
the timestamp in place? So that any reference to that timestamp will
be modified too? Should I be doing a duplication on the string that is
used to key the buffer in the buffers hash? I didn't think that the
actual object was passed in when an argument is supplied, I thought a
copy of it was passed in..

How would I make Timestamp#advance nondestructive?
If it is easier than pasting here I can give you commmit priveleges on
that repository?

Thanks very much for your help,
Toby

On Aug 8, 1:31=A0pm, "Stefan Lang" <perfectly.normal.hac... / gmail.com>
wrote:
> 2008/8/8 tobyclem... / gmail.com <tobyclem... / gmail.com>:
>
>
>
> > Hi all,
>
> > I'm having a really odd memory problem with a small ruby program I've
> > written. It basically takes in lines from input files (which represent
> > router flows), deduplicates them (based on elements of the line) and
> > outputs the unique flows to file. The input file often contains over
> > 300,000 lines of which about 25-30% are duplicates. The trouble I'm
> > having is that the program (which is intended to be long running) does
> > not seem to release any memory back to the system and in fact just
> > increases in memory footprint from iteration to iteration. It should
> > use about 150 MB by my estimates but sails through this and yesterday
> > slowed to a halt at about 1.6GB (due to the GC by my guess). This
> > doesn't make any sense to me as at times I am deleting data structures
> > that occupy at least 50MB of memory.
>
> > The codebase is slightly to big too big to pastie but it is available
> > herehttp://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator.
> > There are actually only 2 classes of importance and 1 script but I
> > don't know if pastie can handle that.
>
> > Any help would be greatly appreciated as the alternative (pressures
> > from above) is to rewrite in Python (which involves me learning
> > Python)
>
> I _think_ I have found a problem. In the main loop (in bin/dedupe),
> you use a single Timestamp instance, which is destructively
> modified by calling advance.
>
> Now this single Timestamp instance is used as a key for _all_
> calls to checksum_buffer.add(). As a result, the @buffers hash
> will always have only one entry and this single entry will hold _all_
> flow.checksum/flow.timestamp pairs ever. Since the retention treshhold
> is 1, this single @buffers entry that hold _all_ data will never be
> deleted.
>
> The solution should be to make Timestamp#advance nondestructive
> and change the line
>
> =A0 timestamp.advance
>
> in the main loop to
>
> =A0 timestamp =3D timestamp.advance
>
> Stefan