"Mauricio FernŠŌdez" <batsman.geo / yahoo.com> wrote in message
news:20030302083833.GA8115 / student.ei.uni-stuttgart.de...

> > it seems to me that the adoption of one-at-a-time hashes does' nt

What does "one at time" mean? One character at a time as opposed to Jenkins
4 characters at a time?

>  * Jenkins should be even worse as the additive constant to its
>    complexity is higher
>  * however, a perf. increase of 1-3% can be achieved by setting the
>    number of bins in the hash to be a power of 2, so that we can later
>    use bitwise AND instead of the modulus operator [I assume our hash is
>    good enough not to increase the num. of collisions a lot]

There is some work going on in Jenkins hash, but it happens at the best
possible location: inside the processor and at least it attempts to exploit
parellism. Modern processors are much faster than memory. If the hash avoids
an additional memory access ever so often, it may pay off decently. I would
estimate the benefits of any good hash will only really show up in large
hash tables. This is because a cache-miss is expensive (many hundred
instruction cycles). An extra memory access in a small hash table is likely
to happen in memory that is already conveniently in cache.

If the bucket size is large you get fewer collissions, but you also increase
the risc of cache-misses.

This must, however, be held against the cache-misses of reading and
comparing the actual keys if not stored inside the bucket itself.

If the full 32-bit (or whatever) hash is compared first (i.e. store the full
hash key in the bucket), the bucket count does not affect the number of
external keys that must be accessed - only the quality of the hash itself.
This technique also makes it cheaper to perform a rehash operation when
expanding the buckettable.
(I sent a copy of my hashtable in private mail). It only operates on 32 bit
keys (e.g. storing pointers to already internalized strings), but it uses
the above mentioned principle of storing the hash and could be enhanced to
operate on longer keys without much effort.

The avoidance of modulus prime is good both for avoiding the calculation,
and because it makes it much easier to scale the bucket size on demand -
thus avoiding large initial bucket tables - which in turn makes conflict
resolution cheaper, although more frequent.

For small hashtables it may not pay off with a good, but more expensive hash
function. When frequently accessed, everything, including keys, will be
in-cache, and the statistical spread over buckets is probably unpredictable.

Mikkel