Aleksi Niemel<aleksi.niemela / cinnober.com> wrote: > >matz says: > > >Hash checks equality of keys by `hash' and `eql?'. > > >Try to define both. > > > > >Agnot asks: > > Ok, thanks. But for me its a violation of "Principle of Least > > Surprise". > > Can someone give a rationale? > >It's pretty common for Hash to require these two things. With the method >hash you can ask "unique" code for each object, thus achieve O(n) searches. >OTOH, you seldom can generate quickly a meaningful hash code which doesn't >give duplicates; those are called collisions. I guess it's quite common to >first find a right list of key-value pairs through a hash code, and then >the >right pair inside the list through some equality check. > >IMHO it's also wise to define eql? instead of blindly use ==, as you might >want the hash comparison to be different from normal comparison. The above explanation makes more sense if you understand what a hash is. Probably most of this list does, but for those who don't here is an explanation. (If you understand hashes then there is no point reading on. If you never really knew what they do and how they work...) Suppose that I want to be able to keep track of things that I have seen and be able to find them again. How would I do it? Well the simplest approach is to just keep a list of what I have seen. Whenever I want to look something up I just scan my list and see if I have it. If I don't I add it to the list. This is simple but quickly becomes slow. So what do I do? Well the obvious solution is have some sort of structured way to hold the data which makes lookup fast. For instance I might file it away alphabetically. I might keep it in a sorted list and do a binary lookup. I might store the data in a tree. I might store the data in a tree and then do some smart stuff to make sure that the tree is fairly well balanced. etc. All of these solutions run into various trade-offs. In general most of the really obvious structured approaches involve either making lots of assumptions, or they involve making O(log(n)) comparisons. If you have a hundred thousand things I far prefer making 18 comparisons than 50,000 on average, but we can still do better on most datasets. Hashes are an example of, "Do something random and then depend on chance". In a hash you have buckets you throw your data in. Generally you try to have somewhat more buckets than data. (Too much more and you waste space. Too little and you get too much in each bucket. What most people do is every so often resize the hash and move things to their real buckets.) You have a hashing function that produces a number. Take that number mod the number of buckets, and you know which bucket to toss it in. Toss it. When you want to know if you have seen something before, you just figure out which bucket it would have been tossed in, and rummage around in that bucket. The laws of probability tell us that most of the time that bucket has little in it. (Actually it tells us that you have a Poisson distribution.) So usually you only have to do something like 1-2 comparisons. (Actually it is better than that. Keep track of the hash value you had, and compare hash values first. Most of the time that is a very fast comparison. So the full equality check is usually just a final sanity check.) But the catch is that if your hashing function has lots of collisions, then you slow down. Oops. But there are well-known hash functions that tend to work very well in practice. So if your hash function doesn't take too long to compute, this is a good strategy. I don't know what hash function Ruby uses. The one that Perl uses works something like this: def hash (string) hash_val = 0 for byte in string.each_byte do hash_val = 33*hash_val + byte end hash_val += hash_val/32 end (Except using integer operations done in C.) Cheers, Ben _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com