Hi, all! I've opened a issue at rubyforge:
http://rubyforge.org/tracker/?func=detail&aid=18507&group_id=426&atid=1698
It is said the issue will be automatically post
to ruby-core ml, but I didn't see it here. So I post this
manually.

== Summary
Ruby String hash key overflow when converting to Fixnum.


== How to reproduce?
Run the ruby code below:

    class MyStr
      def initialize(str)
        @str = str
      end
      def hash
        @str.hash
      end
      def eql?(o)
        @str.eql?(o)
      end
    end

    s1 = "foo"
    s2 = "This"
    my_s1 = MyStr.new(s1)
    my_s2 = MyStr.new(s2)
    h = { s1 => "value of foo", s2 => "value of This" }

    puts "h[my_s1] = #{h[my_s1].inspect}"
    puts "h[my_s2] = #{h[my_s2].inspect}"


== Expected output:

    h[my_s1] = "value of foo"
    h[my_s2] = "value of This"


== Actual output:

    h[my_s1] = "value of foo"
    h[my_s2] = nil


== My environment:

* ruby 1.8.6 (2007-09-24 patchlevel 111) [i686-linux]
* Linux arch 2.6.24-ARCH #1 SMP PREEMPT Sun Feb 10 15:21:33 UTC 2008
i686 Intel(R) Core(TM)2 CPU T5600 @ 1.83GHz GenuineIntel
GNU/Linux

== Why this bug?

Ruby Hash is dealing with String specially. In hash.c:88 rb_any_hash,
we can see that if key is a String, its hash is calculated directly by
rb_str_hash:

    switch (TYPE(a)) {
      case T_FIXNUM:
      case T_SYMBOL:
      return (int)a;
      break;

      case T_STRING:
      return rb_str_hash(a);
      break;

But what's the case when we're calling hash in Ruby (not C)? See
string.c:898, rb_str_hash_m, int is converted to Fixnum:

    int key = rb_str_hash(str);
    return INT2FIX(key);

The calculated hash value of the word "This" is 2073740424, which
exceed the maximum value of a Ruby Fixnum. When converted it becomes
-73743224. You can see it in IRB:

    irb(main):001:0> "This".hash
    => -73743224

That's why h[my_s2] returns nil. I'm stored it using s2 which is a
string. Ruby calculate its hash directly in C (without converting to
Fixnum and then back to int) which produce 2073740424. But when I
later fetch it, the hash calculated is -73743224. So I get nothing.


== Discussion

Someone might argue that the key to store is in fact "not the same" to
which I'm using to fetch. Will this be the desired behavior of Ruby
Hash? I just assume as long as the key produce the same hash value and
defined eql? properly, I can use them as Ducks (as in Duck-Typing).

In fact, in one of my project, I'm constructing a lot of temp small
Strings just to look up some value in a Hash. Those Strings are all
sub-string of a large piece of text. So I made a Substring class
(which stores reference to the text and the sub-string interval) to
avoid copying all those small Strings. The problem is: I'm storing
with normal Ruby String, but fetching with my own customized
Substring, and I get nothing.

== Patch

There're multiple ways to fix this problem. A more "proper" way is to
use INT2NUM instead of INT2FIX to avoid overflow. We can get the
"correct" hash value of a String in Ruby (which may be a
Bignum). However, this might be ineffecient. Constructing a Bignum
need time and memory, and an overhead of calling rb_funcall may be
introduced in rb_any_hash (hash.c:106)

    if (!FIXNUM_P(hval)) {
        hval = rb_funcall(hval, '%', 1, INT2FIX(536870923));
    }

Another more efficient way is to perform truncation when calculating
String hash to make the result meet what we geet in Ruby world. Here's
the patch:


Index: hash.c
===================================================================
--- hash.c      (revision 15640)
+++ hash.c      (working copy)
@@ -97,7 +97,8 @@
        break;

       case T_STRING:
-       return rb_str_hash(a);
+       /* do a shift to make it like being converted to Fixnum and back */
+       return (rb_str_hash(a)<<1>>1);
        break;

       default: