On Sat, Dec 13, 2008 at 22:57, Michael Selig <michael.selig / fs.com.au> wrote:
> From my testing:
> - String equality comparisons seem to be simply done on a byte-by-byte
> basis, without regard to the encoding

Am I misinterpreting something here?

u = "caf.encode("utf-8")
b = u.dup.force_encoding("binary")
i = u.dup.force_encoding("iso-8859-1")
u == b # => false
b == i # => false
u == i # => false
u.eql?(b) # => false

> - There is also a concept of "compatible encodings". Given 2 encodings e12, e1 is compatible with e2 if the representation of every character in e1
> is the same as in e2. This implies that e2 must be a "bigger" encoding than
> e1 - ie: e2 is a superset of e1. Typically we are mainly talking about
> US-ASCII here, which is compatible with most other character sets that are
> either all single-byte (eg: all the ISO-8859 sets) or are variable-length
> multi-byte (eg: UTF-8).
> - When operating on encodings e1 & e2, if e1 is compatible with e2, then
> Ruby treats both strings as being in encoding e2.

I only knew of ASCII-compatibility. Are there other cases? ISO-8859-1
and Windows-1252 (a superset) at least are not compatible:

i = "caf.encode("iso-8859-1")
w = "caf.encode("windows-1252")
i == w # => false
i + w # Encoding::CompatibilityError
w + i # Encoding::CompatibilityError


On Sat, Dec 13, 2008 at 12:01, Brian Candler <B.Candler / pobox.com> wrote:
(...)
> But they go onto the same hash chain:
>
> irb(main):031:0> a.hash
> => 565426832
> irb(main):032:0> b.hash
> => 565426832

This one's interesting. I guess avoiding collisions would be a Good
Thing, but we still must maintain ASCII compatibility, and we don't
always know the ascii_only state of a String. Computing it when
computing the hash of a String does not sound like a bad idea to me,
but if there are more complex encoding compatibility combinations,
then this whole idea starts to get pretty hard.

Anyway, keeping the hash as it is now should have, I hope, very few
collisions in the Real World Most applications will remain in
single-encoding land, and even multilingual ones should hardly need to
store the very same byte sequence in multiple encodings as keys in a
single Hash.

--
Daniel