Cool.  I didn't expect the improvement for largely single-threaded
workloads.  I'm not sure if it's feasible, but it might be
better to detect cache line size with:

	sysconf(_SC_LEVEL1_DCACHE_LINESIZE);

At least on glibc-based systems.  But 64 bytes is a good default
nowadays.

I seem to recall encountering some P4-based Xeons with 128-byte cache
lines, but those are probably obsolete/rare enough to not matter.