The "initialization holes" that leave potential pointers on the stack occur
in the interpreter, any system libraries and the GC itself.  Thus clearing
some stack words before and *after* allocation/GC helps, but at an obvious cost.

  Keeping stack frames small helps, perhaps moving some data structures out of
the C stack into explicit stacks would help there?  A call/cc implemenation
that copies less C stack might also reduce leaks and overhead:

http://github.com/kstephens/ll/tree/master/src/ccont

  Recompiling Ruby with flags to reduce initialization holes will not help
leaks from appearing in initialization holes in system libraries.  We have
some Ruby processes (> 375 MB) that we'd like to keep running longer, but are
unable to do so because of leaks.

  I'll help test your patches on 1.8.6.

Kurt

Brent Roman wrote:
> After a couple weeks of long nights and false starts, I feel I may have come
> up with
> a fix for a large class of Ruby memory leak.  The basic technique is a
> refinement of the
> one Kurt Stephens suggested.  It not only eliminates the leaks in this one
> liner:
> 
>   loop {@x=callcc{|c|c}}
> 
> but also in our multi-threaded robotics application.  Our Ruby process used
> to grow
> to 20+ MB during a day long run.  The same run now stays smaller than 10MB.
> On an embedded ARM Linux machine with only 32MB of DRAM, this is a great
> result!
> 
> The central problem is that gcc (and other compilers) tend to create
> sparse stack frames such that, when a new frame is pushed onto the stack, it
> does not 
> completely overwrite the one that had been previously stored there.  The new
> frame gets
> activated with old VALUE pointers preserved inside its holes.  These become
> "live" again
> as far as any conservative garbage collector is concerned.  And, viola, a
> leak is born!
> 
> I implemented a scheme for recording the maximum depth of the C stack in
> xmalloc and during garbage collection itself.  However, I realized that 
> there was no point in clearing the stack when it is near its maximum depth.
> Instead, stack clearing is deferred until CHECK_INTS, as this tends to
> happen
> between evaluation of nodes, when the stack is likely to be shallower. 
> 
> At this point
> a tight loop quickly zeros the region between the current top of stack, as 
> returned by alloca(0), and the maximum recorded stack extent.  It also
> updates
> the stack extent so no memory is cleared repeatedly if the stack contracts
> further.
> 
> This paper discusses this and similar techniques:
> http://www.hpl.hp.com/personal/Hans_Boehm/gc/papers/pldi93.ps.Z
> 
> Another related issue is that the style of rb_eval() in eval.c in the
> 1.8 and 1.6 series causes gcc to emit a especially large and sparse stack
> frames.
> Consider that gcc allocates two pair of stack slots for r and l in
> constructs like this:
> 
>     switch (nd_type(node)) {
>  	/* nodes for speed-up(literal match) */
>       case NODE_MATCH2:
> 	{
> 	    VALUE l = rb_eval(self,node->nd_recv);
> 	    VALUE r = rb_eval(self,node->nd_value);
> 	    result = rb_reg_match(l, r);
> 	}
> 	break;
> 
> 	/* nodes for speed-up(literal match) */
>       case NODE_MATCH3:
> 	{
> 	    VALUE r = rb_eval(self,node->nd_recv);
> 	    VALUE l = rb_eval(self,node->nd_value);
> ....
> 
> By the time the compiler's optimizer is allocating stack frame slots, all
> the block structure
> of the original code has been lost in various transformations.
> As a result, each rb_eval() call ends up pushing about 4k bytes onto the C
> stack, 
> of which less than 20% is even initialized.    This means that:
> 
> 1)  There is a high probability that old VALUEs from previous frames 
>      will be resurrected as the stack grows,
> 
> 2)  The GC must scan a sparse, large stack and mark the many dead object
> pointers it contains.
> 
> 3)  callcc and thread context switches must copy needlessly large stacks
> 
> 4)  recursive Ruby programs run out of stack space much earlier than than
> they might otherwise.
> 
> When I simply re-factored rb_eval() such that it calls a (non-inline)
> function
> for each node type it encounters, the total observed C stack size for my
> application
> was reduced by more than two thirds.  Not surprisingly, threading and
> continuation
> micro benchmarks and run about 3 - 4 times faster.  However, I expect that
> benchmarks that
> operate repeatedly on a few large, long lived objects will run slower.
> 
> Keep in mind that these techniques should improve the performance of *any*
> garbage
> collector that scans the unstructured C stack for valid object pointers.  It
> may
> even be relevant for the 1.9 series Ruby, but I'll leave that for those more
> qualified to determine.
> 
> Today, this is implemented only in my heavily patched version of Ruby 1.6.8.
> In the short term, if there's interest,
> I can quickly post my hacked 1.6.8 Ruby to an FTP site for others to test.
> 
> Longer term,
> The stack clearing could be supplied as a small patch to the 1.8 series,
> however the
> refactoring of rb_eval() is probably too large to be attached to an email
> message
> on this list.  I will take the time to produce these patches only if at
> least a few people
> commit to testing them,  reporting detailed results and suggestions for
> improvement here.
> 
> - brent
> 
>