On Wed, 24 Dec 2008 18:32:57 +1100, Brent Roman <brent / mbari.org> wrote: > As an experiment, I tried substituting memset for my tight stack clearing > loop... > > and discovered that memset() is actually quite a large function, > and gcc does not inline it. It is large because, in this context, the > compiler > cannot tell that the pointers are already long-word aligned and that we > are copying an integer number of long words. So it emits code to copy > bytes on either end. Try using the gcc option "-minline-all-stringops". I think that should force memset (and other stuff) to be inlined. > On the other hand... > Very recently, folks who've looked into this far more intensively than > I concluded that an unrolled 'C' loop was better than the venerable > > rep stols > > assembly instructions used by x86 gcc's __built_in_memset(). See: > > http://sourceware.org/ml/newlib/2008/msg00286.html > > They note that microcoded instructions are slower than simple ones for > the modern x86 (RISC-ish) execution cores. The fastest way to clear > memory these days is supposedly to use MMX instructions. > (I'm not going there, but I welcome others to explore where that might > lead Thanks for this reference. I got the impression that he was saying that: - Memset on GCC 3.4 could be slower than his C tight loop when working on unaligned data. However I thhink that this may be fixed in GCC 4. - "rep stosl" was fastest when working on 8-byte aligned data on some x86 platforms. His assembly patch seems to set the first few bytes until it gets to an address divisible by 8, then uses "rep stosl" from there. I think GCC 4.3.2 seems to do 4 byte aligned copies using "rep stosl" when inlined. However his code ALWAYS did a function call to memset or a version of it, so it is not clear whether the function call overhead makes much difference compared to inlining the memset call. The fact that you didn't notice much difference between the C loop and a function call to memset() seems to imply that this optimization may not be all that important to ruby stack clearing. It really depends on how often it is called, and how much it is clearing at a time. It is probably worth benchmarking a little more, but I may be barking up the wrong tree here! Cheers Mike