On Sun, Dec 14, 2008 at 09:57:55AM +0900, Michael Selig wrote:
> On Sun, 14 Dec 2008 01:01:44 +1100, Brian Candler <B.Candler / pobox.com>  
> wrote:
>
>
>> For example, what are the semantics of
>> comparing strings with different encodings? Are they compared  
>> byte-by-byte,
>> or character-by-character as unicode codepoints, or some other way?
>
> Yes, I agree this needs to be documentated a lot better than it is at the 
> moment.
> I also think that some of the behaviour is a little "unexpected" :) 
> though this is only in unusual cases.

Thank you for your detailled explanation.

The other thing that concerns me most is how much more 'magic' behaviour is
there which I need to know about, and what I need to do to turn it off when
I am dealing with binary data. For me, DTRT means Leave My Binary Data Alone
:-)

e.g. I read that File.open now has an :encoding=>... option. This is not
documented in ri, apart from showing there is an [opt] parameter.

Does the :encoding default to binary, or to the encoding of the source file,
or the terminal from which it is run, or to something else? Does it try to
be clever, e.g. taste the Unicode BOM?? For me that would be the "wrong"
thing. For me, it seems to be doing the wrong thing by default:

irb(main):004:0> File.open("/bin/sh").gets.encoding
=> #<Encoding:UTF-8>

Ditto for sockets.

Ditto for Net::HTTP - e.g. does it try to use the Content-Type ... charset
header? If not now, might it do so in future??

At the moment I am worried about the semantics of the most basic low-level
operations. For example, if I read some bytes from file A, and compare it to
a string literal B, I want to be sure they will compare equal if they are
the same sequence of bytes. By the sound of it, this means I have to declare
:encoding=>"BINARY" everywhere, or at least be confident that everything I
do has this as a default (and will remain so going forward); if I forget
one, I may introduce subtle bugs into my program.

Even BINARY seems to be a second-class alias:

irb(main):010:0> a.force_encoding("BINARY")
=> "a\xC3\x9F"
irb(main):011:0> a.encoding
=> #<Encoding:ASCII-8BIT>

Ruby is telling me that all data is text, whereas I believe the opposite is
true :-)

Regards,

Brian.