On Sun, 14 Dec 2008 01:01:44 +1100, Brian Candler <B.Candler / pobox.com> wrote: > For example, what are the semantics of > comparing strings with different encodings? Are they compared > byte-by-byte, > or character-by-character as unicode codepoints, or some other way? Yes, I agree this needs to be documentated a lot better than it is at the moment. I also think that some of the behaviour is a little "unexpected" :) though this is only in unusual cases. From my testing: - String operations are done using the bytes in the strings - they are not converted to codepoints internally - String equality comparisons seem to be simply done on a byte-by-byte basis, without regard to the encoding - *However* other operations are not simply byte-by-byte. They are done character-by-character, but without converting to codepoints - eg: a 3 byte character is kept as 3 bytes. For example this means that when operating on a variable-length encoding, simple operations like indexing can be inefficient, as Ruby may have to scan through the string from the start. However Ruby does try to optimize this where possible. - There is also a concept of "compatible encodings". Given 2 encodings e1 & e2, e1 is compatible with e2 if the representation of every character in e1 is the same as in e2. This implies that e2 must be a "bigger" encoding than e1 - ie: e2 is a superset of e1. Typically we are mainly talking about US-ASCII here, which is compatible with most other character sets that are either all single-byte (eg: all the ISO-8859 sets) or are variable-length multi-byte (eg: UTF-8). - When operating on encodings e1 & e2, if e1 is compatible with e2, then Ruby treats both strings as being in encoding e2. - String#> and String#< are a bit wierd. Normally they are just done on a byte-by-byte basis, UNLESS the strings are the same and are incompatible encodings, then they always seem to return FALSE. (I have to check this - it may be more complicated than this). - When operating on incompatible encodings, *normally* non-comparison operations (including regexp matches) raise an "Encoding Compatibility Error". - However there appears to be an exception to this: if operating on 2 incompatible encodings AND US-ASCII is compatible with both, AND both strings are US-ASCII strings, then the operation appears to proceed, treating both as US-ASCII. For example "abc" as an ISO-8859-1 and "abc" as UTF-8. I guess this is Ruby being "forgiving". (Personally I am not sure if this is good or bad). The encoding of the result (for example of a string concatenation) seems to be one of the 2 original encodings - I haven't figured out the logic to this yet :) James - feel free to use any of the above to add to your excellent M17N summary. Cheers Mike