On Sat, Dec 13, 2008 at 22:57, Michael Selig <michael.selig / fs.com.au> wrote: > From my testing: > - String equality comparisons seem to be simply done on a byte-by-byte > basis, without regard to the encoding Am I misinterpreting something here? u = "caf.encode("utf-8") b = u.dup.force_encoding("binary") i = u.dup.force_encoding("iso-8859-1") u == b # => false b == i # => false u == i # => false u.eql?(b) # => false > - There is also a concept of "compatible encodings". Given 2 encodings e12, e1 is compatible with e2 if the representation of every character in e1 > is the same as in e2. This implies that e2 must be a "bigger" encoding than > e1 - ie: e2 is a superset of e1. Typically we are mainly talking about > US-ASCII here, which is compatible with most other character sets that are > either all single-byte (eg: all the ISO-8859 sets) or are variable-length > multi-byte (eg: UTF-8). > - When operating on encodings e1 & e2, if e1 is compatible with e2, then > Ruby treats both strings as being in encoding e2. I only knew of ASCII-compatibility. Are there other cases? ISO-8859-1 and Windows-1252 (a superset) at least are not compatible: i = "caf.encode("iso-8859-1") w = "caf.encode("windows-1252") i == w # => false i + w # Encoding::CompatibilityError w + i # Encoding::CompatibilityError On Sat, Dec 13, 2008 at 12:01, Brian Candler <B.Candler / pobox.com> wrote: (...) > But they go onto the same hash chain: > > irb(main):031:0> a.hash > => 565426832 > irb(main):032:0> b.hash > => 565426832 This one's interesting. I guess avoiding collisions would be a Good Thing, but we still must maintain ASCII compatibility, and we don't always know the ascii_only state of a String. Computing it when computing the hash of a String does not sound like a bad idea to me, but if there are more complex encoding compatibility combinations, then this whole idea starts to get pretty hard. Anyway, keeping the hash as it is now should have, I hope, very few collisions in the Real World Most applications will remain in single-encoding land, and even multilingual ones should hardly need to store the very same byte sequence in multiple encodings as keys in a single Hash. -- Daniel