On Sun, 14 Dec 2008 01:01:44 +1100, Brian Candler <B.Candler / pobox.com>  
wrote:


> For example, what are the semantics of
> comparing strings with different encodings? Are they compared  
> byte-by-byte,
> or character-by-character as unicode codepoints, or some other way?

Yes, I agree this needs to be documentated a lot better than it is at the  
moment.
I also think that some of the behaviour is a little "unexpected" :) though  
this is only in unusual cases.

 From my testing:
- String operations are done using the bytes in the strings - they are not  
converted to codepoints internally
- String equality comparisons seem to be simply done on a byte-by-byte  
basis, without regard to the encoding
- *However* other operations are not simply byte-by-byte. They are done  
character-by-character, but without converting to codepoints - eg: a 3  
byte character is kept as 3 bytes. For example this means that when  
operating on a variable-length encoding, simple operations like indexing  
can be inefficient, as Ruby may have to scan through the string from the  
start. However Ruby does try to optimize this where possible.
- There is also a concept of "compatible encodings". Given 2 encodings e1  
& e2, e1 is compatible with e2 if the representation of every character in  
e1 is the same as in e2. This implies that e2 must be a "bigger" encoding  
than e1 - ie: e2 is a superset of e1. Typically we are mainly talking  
about US-ASCII here, which is compatible with most other character sets  
that are either all single-byte (eg: all the ISO-8859 sets) or are  
variable-length multi-byte (eg: UTF-8).
- When operating on encodings e1 & e2, if e1 is compatible with e2, then  
Ruby treats both strings as being in encoding e2.
- String#> and String#<  are a bit wierd. Normally they are just done on a  
byte-by-byte basis, UNLESS the strings are the same and are incompatible  
encodings, then they always seem to return FALSE. (I have to check this -  
it may be more complicated than this).
- When operating on incompatible encodings, *normally* non-comparison  
operations (including regexp matches) raise an "Encoding Compatibility  
Error".
- However there appears to be an exception to this: if operating on 2  
incompatible encodings AND US-ASCII is compatible with both, AND both  
strings are US-ASCII strings, then the operation appears to proceed,  
treating both as US-ASCII. For example "abc" as an ISO-8859-1 and "abc" as  
UTF-8. I guess this is Ruby being "forgiving". (Personally I am not sure  
if this is good or bad). The encoding of the result (for example of a  
string concatenation) seems to be one of the 2 original encodings - I  
haven't figured out the logic to this yet :)

James - feel free to use any of the above to add to your excellent M17N  
summary.

Cheers
Mike