Summary
=============================================================
A brief analysis of Ruby versions of a few string hashing algorithms,
and my conclusions about their efficacy at preventing collisions.


Background
=============================================================
At work, we have some (young) code which is currently doing a lot of
string compares, which needs to be really fast. We're about to try
using hashes of the strings to represent each string instead, but
instead of a traditional hash table (which 'chains' collisions for the
same key) we're going to require that no two (different) strings may
share the same hash key.

As a result, I wanted to do a preliminary investigation into various
string hashing algorithms, to see how limiting that would be. (How
often would we need to rename a bunch of our strings just to get around
collisions?) Given the nature of hashing algorithms, it was important
that this be based on ~real world strings.


Methodology
=============================================================
I wrote a ruby script to scan all our (C++) source code and find ANY
words that were valid C++ identifiers. This also meant that I grabbed
all such words from inside comment blocks, as well as C++ keywords, but
that was OK for this back-of-the-envelope testing. I ended up with
almost 43,000 unique identifiers.

I then ran through the list of words and used 7 different string
hashing algorithms (which I ported from the C code I found on a web
page[1]) to compute different hash keys. I used a Set (per hashing
algorithm) to store these keys, and at the end compared the number of
keys in the Set with the number of identifiers fed in.


The Results
=============================================================
Three of the hashing algorithms ended up with no collisions whatsoever,
and one of these (SDBM) is really simple and fast (in C).

Here is the output of my test:
djb  :: 99.8601 percent coverage (60 collisions out of 42884)
elf  :: 99.5430 percent coverage (196 collisions out of 42884)
sdbm :: 100.0000 percent coverage (0 collisions out of 42884)
js   :: 99.9044 percent coverage (41 collisions out of 42884)
ap   :: 100.0000 percent coverage (0 collisions out of 42884)
bkdr :: 99.9953 percent coverage (2 collisions out of 42884)
rs   :: 100.0000 percent coverage (0 collisions out of 42884)


The Algorithms
=============================================================
module HashEm
  SIGNEDSHORT = 0x7FFFFFFF

  def self.rs( str, len=str.length )
    a,b = 63689,378551
    hash = 0
    len.times{ |i|
      hash = hash*a + str[i]
      a *= b
    }
    hash & SIGNEDSHORT
  end

  def self.js( str, len=str.length )
    hash = 1315423911
    len.times{ |i|
      hash ^= ( ( hash << 5 ) + str[i] + ( hash >> 2 ) )
    }
    hash & SIGNEDSHORT
  end

  def self.elf( str, len=str.length )
    hash = 0
    x = 0
    len.times{ |i|
      hash = (hash << 4) + str[i]
      if  (x = hash & 0xF0000000) != 0
        hash ^= (x >> 24)
        hash &= ~x
      end
    }
    hash & SIGNEDSHORT
  end

  def self.bkdr( str, len=str.length )
    seed = 131    # 31 131 1313 13131 131313 etc..
    hash = 0
    len.times{ |i|
      hash = ( hash*seed )+str[i]
    }
    hash & SIGNEDSHORT
  end

  def self.sdbm( str, len=str.length )
    hash = 0
    len.times{ |i|
      hash = str[i] + ( hash << 6 ) + ( hash << 16 ) - hash
    }
    hash & SIGNEDSHORT
  end

  def self.djb( str, len=str.length )
    hash = 5381
    len.times{ |i|
      hash = ((hash << 5) + hash) + str[i]
    }
    hash & SIGNEDSHORT
  end

  def self.ap( str, len=str.length )
    hash = 0
    len.times{ |i|
      if (i & 1) == 0
        hash ^= (hash << 7) ^ str[i] ^ (hash >> 3)
      else
        hash ^= ~( (hash << 11) ^ str[i] ^ (hash >> 5) )
      end
    }
    hash & SIGNEDSHORT
  end
end



[1] http://www.code-source.org/algorithm.php?id=62