On Wed, 24 Dec 2008 18:32:57 +1100, Brent Roman <brent / mbari.org> wrote:

> As an experiment, I tried substituting memset for my tight stack clearing
> loop...
>
> and discovered that memset() is actually quite a large function,
> and gcc does not inline it.  It is large because,  in this context, the
> compiler
> cannot tell that the pointers are already long-word aligned and that we
> are copying an integer number of long words.  So it emits code to copy
> bytes on either end.

Try using the gcc option "-minline-all-stringops". I think that should  
force memset (and other stuff) to be inlined.


> On the other hand...
> Very recently, folks who've looked into this far more intensively than
> I concluded that an unrolled 'C' loop was better than the venerable
>
>   rep stols
>
> assembly instructions used by x86 gcc's __built_in_memset().  See:
>
> http://sourceware.org/ml/newlib/2008/msg00286.html
>
> They note that microcoded instructions are slower than simple ones for
> the modern x86 (RISC-ish) execution cores.  The fastest way to clear
> memory these days is supposedly to use MMX instructions.
> (I'm not going there, but I welcome others to explore where that might  
> lead

Thanks for this reference. I got the impression that he was saying that:
- Memset on GCC 3.4 could be slower than his C tight loop when working on  
unaligned data. However I thhink that this may be fixed in GCC 4.
- "rep stosl" was fastest when working on 8-byte aligned data on some x86  
platforms. His assembly patch seems to set the first few bytes until it  
gets to an address divisible by 8, then uses "rep stosl" from there. I  
think GCC 4.3.2 seems to do 4 byte aligned copies using "rep stosl" when  
inlined.
However his code ALWAYS did a function call to memset or a version of it,  
so it is not clear whether the function call overhead makes much  
difference compared to inlining the memset call.

The fact that you didn't notice much difference between the C loop and a  
function call to memset() seems to imply that this optimization may not be  
all that important to ruby stack clearing. It really depends on how often  
it is called, and how much it is clearing at a time. It is probably worth  
benchmarking a little more, but I may be barking up the wrong tree here!

Cheers
Mike