On Sat, Dec 13, 2008 at 08:33:13PM +0900, Charles Oliver Nutter wrote:
> Very good point; symbols are not necessarily created in the file where  
> you use their literal form, and therefore need to have a single encoding  
> everywhere. I concur.

Unless :p<UTF-8> and :p<US-ASCII> could somehow be the "same" symbol (that
is, send() would find the same method)

Aside: is there a page somewhere which documents in detail the semantics of
ruby 1.9's Strings and encodings? For example, what are the semantics of
comparing strings with different encodings? Are they compared byte-by-byte,
or character-by-character as unicode codepoints, or some other way? It
doesn't seem to make a difference here:

irb(main):001:0> a = "abc"
=> "abc"
irb(main):002:0> b = a.dup
=> "abc"
irb(main):003:0> a.encoding
=> #<Encoding:US-ASCII>
irb(main):004:0> b.force_encoding("UTF-8")
=> "abc"
irb(main):005:0> a == b
=> true
irb(main):006:0> b.force_encoding("BINARY")
=> "abc"
irb(main):007:0> a == b
=> true

But it does here:

irb(main):018:0> a = "a"
=> "a"
irb(main):019:0> b = a.dup
=> "a"
irb(main):020:0> a.encoding
=> #<Encoding:UTF-8>
irb(main):021:0> b.force_encoding("BINARY")
=> "a\xC3\x9F"
irb(main):022:0> a == b
=> false

What if I give the "same" character but from a different encoding?

irb(main):001:0> a = "a"
=> "a"
irb(main):002:0> b = "a\xdf"
=> "a\xDF"
irb(main):003:0> b.force_encoding("ISO-8859-1")
=> "a"
irb(main):004:0> a == b
=> false

(I think that's right - both are codepoint 223)

Furthermore, what if I use a String as a key to a hash? It seems the
encoding *is* taken into consideration:

irb(main):025:0> a = "a"
=> "a"
irb(main):026:0> h = {a => 99}
=> {"a"=>99}
irb(main):027:0> b = a.dup
=> "a"
irb(main):028:0> h[b]
=> 99
irb(main):029:0> b.force_encoding("BINARY")
=> "a\xC3\x9F"
irb(main):030:0> h[b]
=> nil

But they go onto the same hash chain:

irb(main):031:0> a.hash
=> 565426832
irb(main):032:0> b.hash
=> 565426832

What does 'inspect' do when the string has a particular encoding? And what
does irb do when outputting a string whose encoding is different to that of
the terminal?

Not understanding these rules makes me very uncomfortable.

ri documentation seems to be pretty silent on these points:

-------------------------------------------------------------- String#==
     str == obj   => true or false

     From Ruby 1.9.1
------------------------------------------------------------------------
     Equality---If _obj_ is not a +String+, returns +false+. Otherwise,
     returns +true+ if _str_ +<=>+ _obj_ returns zero.


------------------------------------------------------------- String#<=>
     str <=> other_str   => -1, 0, +1

     From Ruby 1.9.1
------------------------------------------------------------------------
     Comparison---Returns -1 if _other_str_ is less than, 0 if
     _other_str_ is equal to, and +1 if _other_str_ is greater than
     _str_. If the strings are of different lengths, and the strings are
     equal when compared up to the shortest length, then the longer
     string is considered greater than the shorter one. In older
     versions of Ruby, setting +$=+ allowed case-insensitive
     comparisons; this is now deprecated in favor of using
     +String#casecmp+.

     +<=>+ is the basis for the methods +<+, +<=+, +>+, +>=+, and
     +between?+, included from module +Comparable+. The method
     +String#==+ does not use +Comparable#==+.

        "abcdef" <=> "abcde"     #=> 1
        "abcdef" <=> "abcdef"    #=> 0
        "abcdef" <=> "abcdefg"   #=> -1
        "abcdef" <=> "ABCDEF"    #=> 1


As I say, if there's some more detailled documentation please could you
point me in the right direction.

Thanks,

Brian.