M. Edward (Ed) Borasky wrote:
> Lionel Bouton wrote:
>> I've run the QUIZ benchmark on several systems to check their 
>> relative performance and had quite a surprise.
>> I'll take Frank's algorithm as reference for simplicity as it spends 
>> roughly the same time for each point distribution in 
>> 157_benchmark_2.rb but this is true for all of them.
>>
>> This is on Gentoo Linux, all systems are compiled with gcc 4.1.2
>> on 32bit : -O3 -march=i686 -fomit-frame-pointer -pipe
>> on 64bit : -O2 -pipe
>
> Hold on!! Try recompiling the 32-bit version with "march=pentium4" and 
> the 64-bit version with "-O3 -march=athlon64" and *then* compare the 
> timings! You've saddled the 64-bit version with some default 
> architecture and one less optimization level than the 32-bit version.

Given that there's only one architecture to choose from, I don't think 
there would be any benefit in telling gcc to use it... I could have 
specified -mcpu but I don't want to rely on the exact CPU used (putting 
disks in another system can be handy).
-O2 is recommended for 64bit as -O3 is often slower and don't give much 
benefit when there is one.

Anyway, I can live with the fact that Ruby is 25% slower on data 
crunching when PostgreSQL can fly. In the future I'll simply use 32bit 
systems for web frontends if my benchmarks confirm this trend.

> On the Athlon, you can figure out what's going on with CodeAnalyst. I 
> would guess it's something to do with cache thrashing or lack thereof. 
> On the Intel, you might be able to get some results from CodeAnalyst 
> -- it's basically a wrapper around "oprofile". But you might end up 
> needing Intel's VTune. If you do this for a living, it's worth 
> spending the money. :)
>

These are CPU-level profiling tools, as:
- the CPU is the same in and out of UML,
- I certainly don't have access to the performance counters from 
user-mode-linux,
they won't be of much use for me (profiling the behavior of 
user-mode-linux is not what I'm after).

Eventually when I have time to narrow down the problem myself, I'll 
launch strace on the benchmark, study the differences and submit the 
list of system calls to the UML coders asking why some can be faster on 
UML than on the host kernel.

Lionel