I just had a quick play with the gcc option "-minline-all-stringops".
It was definitely a step in the right direction.

Because it in-lined the memset, I could safely remove the offset kludge
(as there was no longer a memset() stack frame to preserve)

But, the compiler still emitted (useless) code to longword align after the 
main block of the memset operation.  This reformulation of the macro
eliminates that (and removes the offset):

#define __stack_zero_down(end,sp) \
  if (sp > end) memset(end, 0, (sp-end)*sizeof(VALUE))

Now the generated code looks quite clean:

	movl	%edx, %ecx
	subl	%edi, %ecx
	andl	$-4, %ecx
	cmpl	$4, %ecx
	jb	.L1508  ;skip if sp<=end
	shrl	$2, %ecx
	xorl	%eax, %eax
	rep stosl

However, I still don't see any improvement on my little benchmarks.
If someone comes up with an app or test case where these patches appear 
to slow things down, then I'll ask them to try this alternative and 
perhaps we'll see an improvement.

I'm leery of this technique because, if you omit -minline-all-stringops, one
must offset the stack pointer for the size of the memset() frame to preserve
it, otherwise the memset causes a segfault.  

This optimization is very machine/compiler dependent and the gain is not yet
demonstrated.
But, it's reassuring to have worked it out.  Thanks for the tip!

- brent


Michael Selig wrote:
> 
> 
> Try using the gcc option "-minline-all-stringops". I think that should  
> force memset (and other stuff) to be inlined.
> 
> 
>> On the other hand...
>> Very recently, folks who've looked into this far more intensively than
>> I concluded that an unrolled 'C' loop was better than the venerable
>>
>>   rep stols
>>
>> assembly instructions used by x86 gcc's __built_in_memset().  See:
>>
>> http://sourceware.org/ml/newlib/2008/msg00286.html
>>
>> They note that microcoded instructions are slower than simple ones for
>> the modern x86 (RISC-ish) execution cores.  The fastest way to clear
>> memory these days is supposedly to use MMX instructions.
>> (I'm not going there, but I welcome others to explore where that might  
>> lead
> 
> Thanks for this reference. I got the impression that he was saying that:
> - Memset on GCC 3.4 could be slower than his C tight loop when working on  
> unaligned data. However I thhink that this may be fixed in GCC 4.
> - "rep stosl" was fastest when working on 8-byte aligned data on some x86  
> platforms. His assembly patch seems to set the first few bytes until it  
> gets to an address divisible by 8, then uses "rep stosl" from there. I  
> think GCC 4.3.2 seems to do 4 byte aligned copies using "rep stosl" when  
> inlined.
> However his code ALWAYS did a function call to memset or a version of it,  
> so it is not clear whether the function call overhead makes much  
> difference compared to inlining the memset call.
> 
> The fact that you didn't notice much difference between the C loop and a  
> function call to memset() seems to imply that this optimization may not be  
> all that important to ruby stack clearing. It really depends on how often  
> it is called, and how much it is clearing at a time. It is probably worth  
> benchmarking a little more, but I may be barking up the wrong tree here!
> 
> Cheers
> Mike
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/-ruby-core%3A19846---Bug--744--memory-leak-in-callcc--tp20447794p21165679.html
Sent from the ruby-core mailing list archive at Nabble.com.