"gabriele renzi" <surrender_it / remove.yahoo.it> wrote in message
news:9ed46v0r2nn4r7ehchcl6s4ci8t7ccl23h / 4ax.com...
> il Sun, 2 Mar 2003 17:12:28 +0100, "MikkelFJ"
> <mikkelfj-anti-spam / bigfoot.com> ha scritto::

> >This is because a cache-miss is expensive (many hundred
> >instruction cycles). An extra memory access in a small hash table is
likely
> >to happen in memory that is already conveniently in cache.
>
> I don't grok what you mean talking about the cache :(

I'll try to explain although I'm not sure what part you are missing. You are
probably aware of most that I cover:

1) how expensive memory access can be
2) when memory access happens.

1) I'm no expert on cache systems, but there are several layers of cache
starting from virtual on-disk memory through TLB hits/misses (the virtual
memory addressing system) to 2nd level cache, 1st level cache and actual CPU
registers. The further down the pipeline, the more costly the access. I
believe an in-memory cache miss can range from hundreds to thousands of
missed instructions and paged memory is out of the scale. In comparioson, a
major cost of process context switching seems to be the flush of the TLB
cache. Memory in 1st level cache may be as fast as CPU registers.

2) There are two major ways of avoiding cache penalties: a) keeping you data
smaller such that more fits into a cache closer to the CPU. b) locality of
reference - i.e. the next access is cheap of close to a previous access and
therefore will be loaded into fast cache along with the first access.

There are differnt hash table implementation strategies, but a hash works by
spreading data all over the place.

First a bucket is located in a bucket table. This is a once per lookup
operation. If the bucket table is small and there are many lookups, there is
a good chance that the next lookup will find a bucket in fast cache and
avoid many hundres worth of instructions missed on failed cache.

Once the bucket is located, there is a list of possible matches, all with
same hash key. The better the hash, the shorter the list. The list is
typically searched from head to tail. Each visit risks a cache miss. The
good thing is that all entries can be stored close to each other in an
array, so the risk of a cache miss is proportional to the actual data
stored. This is contrary to the bucket table where the risk is proportional
to the total number of buckets. Fortunately the bucket table can be as small
as a single pointer. It therefore makes sense to use about as much memory on
buckets as on stored hash entries. If the entry table is allocated in one
block, it will on average be 50%-75% depending on expansion scheme. Thus an
easy option is to set the bucket table to 50% of entry table, but this can
only be tuned by careful performance measures as there are many factors. A
bucket size the same as the average load of the entry table will on average
give 1 entry per bucket and have equal load on the cache as the entry
table - a larger bucket table may just consume too much memory and reduce
cache effiency - but this also depends on the quality of the hash key.

The next cache problem, is the number of key comparison operations. If the
bucket size is chosen as mentioned above, we will only expect slightly more
than one entry per bucket on average, although collisions will occur -
especially with a bad hash. For the purpose of illustration, assume a
collision list on average is 4 entries long, then on average 2 keys must be
examined for each lookup. If the key is small enough to be stored in the
hash entry itself, this comparison is cheap. If the key is stored in
external data, each key may potentially be a cache miss as well. An external
key may also be slow to compare for other reasons than potential cache miss:
long string comparions, function calls to compare function etc., so
collisions should be avoided.

One technique to avoid accessing external keys more often than necessary is
to compare the full hash key before visiting the key: Typically a hash key
is 32 bit and the crunched to bucket table size either using modulo prime or
in the case of Jenkins, the much faster modulo power of two. By storing the
full hash key in the hash entry, the collision list is effectively made much
shorter because the risc of collisions in the full hash is much smaller than
the risc of collisions in the full hash. It is very cheap to compare the
full hash if it is the original output of the hash operation and fits a
machineword (or two). We still may need to visit all entries in the
collision list, but probably only compare a single external key.
Storing the hash key in the hash entry makes the entry larger and therefore
increases the risc of cache misses. Storing the hash key has another
benefit: growing the hash table can avoid recalculating the hash keys. A
typical hash table entry would therefore be:

<hash-key, key or link to key, data, next-collison link>

or the minimum: <link to key and data, next-collision link>

Or to put a long story short: CPU is much faster than memory, so any
performance optimization should first and foremost minimize the number of
memory accesses that are far apart. If the typical case is access to a
single location, or access to data in the same cache, a complex algorithm
reducing memory accesses may be too expensive. Notably a tight loop is
faster than most things. Code is also subject to cache misses in the
instruction pipeline so an algorithm can get too complex.

> It seems to me that 1-at-a-time didn't depend on accessing memory more
> than jenkins.

This is true. As soon as the first bit of the key is peeked at, the entire
key is probably already in fast cache. This is an example of locality of
reference. Therefore it probably doesn't really matter that much. You are
looping more frequently but then there are predictive branching and Jenkins
has other overhead. Loop unrolling is difficult on variable length keys, but
the compiler might unroll one-at-time to handle an entire word per loop.
Thus the result is blurred. As long as the end result is a high quality hash
 and the hash isn't too slow, it may be more important how many collisions
can be avoided.

You simply have to performance test.

> Using a big hash would stress the caching system, but we won't be
> measuring the hash performance .

Not in a major way, perhaps you are missing the point?
I'm talking about using the 32bit (machine word) that is typically resulting
from a hash key, not a full MD5 signature (although in some cases that might
be relevant). It is true that storing the hash key takes up more memory and
it may not necessarily pay off as discussed above.

> Probably I misunderstood something, would you please explain me what
> problem this algorithm could have with the cache stuff?

The quality of the hash key affects the number of memory accesses and
therefore the risc of cache miss.

> I agree that maybe jenkins could exploit modern CPU superscalar arch,
> but we miss the ability to inline the code reducing the number of
> operations from  9n to 5n.

Perhaps - but you could specialize the hash for short keys as I did for 32
bit. A lot of keywords are short. Jenkins did create the core Mix operation
as a macro. But your argument may be valid and performance tests are, as
usual, in place. Indeed I have had a very fast datastructure loose 50% or
more of its performance on a lost inline opportunity in the lookup method
(new processors may change this - IA64 have fast function calls using
rolling stack).

One thing that a typical performance test will not capture is the number of
cache misses in real life operations because the test is constanly banging
against the same datastructure and therefore downplays the risk of cache
misses. This will give a worse but faster hash key an advantage, but it is
equally likely to overestimate the need for large hash tables where the
quality probably matters the most.

> And,
> (I didn't passed passed my CPU-related exam so good, so maybe I'm
> wrong) having the work on 1 char per time won't damage the
> parallelism, the cpu could work on more than one char at a time
> anyway.

I'm not up to speed on the latest CPU architectures, so I can only produce a
qualified guess. I believe your point is not an unrealistic assumption for
modern achitectures. Some older processors perform much worse on unaligned
memory access (and some do not allow it, forcing the compiler to insert
shift and mask instructions in the code). And some operate only on 8 bit
anyway (microcontrollers, still going strong). The faster the processor, the
less important the parallelism is (and hash function performance generally)
because it is going to wait on the memory.

By the way, speaking of achitectures, someone made the point that DDR RAM
sometimes result in poor performance because it supposedly isn't very
effective at filling meeting random access requests. That is, streamlined
access as in video streaming is fast, but hash table lookups may be poor.

Also, don't forget to add the cost of taking modulo prime on the final hash
result in the non-Jenkins case. The modulo prime can be expensive. If the
search key is short and the lookup hits perfect cache conditions, the
modulos may take significant time - but then the latest processors may
already have changed that.

> It seems that 1-at-a-time scales well, and that it is actually better
> than what we have in current ruby,  at least it got less collision.

I thought Jenkins carefully chose his function to produce fewer collisions,
but that may be a tradeoff with avoding the use of modulo prime.

> And well, the perl guys won't put it in 5.8 if it was'nt someway
> better than the old :)

Again, the hash key calculation cost is probably only the significant factor
significant for small frequently used small hash tables.

In conclusion there are so many different operating conditions, that it is
difficult to predict what hash function will be best, or even if a hash
table is the best overall.

I have been working a lot with variants of B-Trees because they are more
than adequately fast in most scenarios and scale pretty well when cache
constraints starts to be significant, even (or especially) on-disk if
allocation is done appropriately. A lot of people claim that the much
simpler skip-list datastructure perform better than B-Trees. It doesn't.
Someone made the counter-argument that B-Trees are faster due to the poor
cache behavior of skip-lists and backed it up with tests. I reproduced the
tests by implemting a skiplist and compared to a B-Tree implementation I had
already made. The scan through a buffer of a small-fanned B-Tree node is
fast both in terms of CPU and memory cache. Therefore I have great respect
for cache behavior in datastructures.

Mikkel