Tom Link wrote:
>> but now I know my program will puke
>> on any text file with multibyte characters.
> 
> Not necessarily.
> 
> Here is a useful summary of encodings in 1.9:
> http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
> 
> Basically, you have script encoding, internal encoding, and external
> encoding. In you case, you should probably read the files as ASCII8BIT
> or binary, I guess.

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that 
all external data is text, unless explicitly told otherwise.

So if you deal with data which is not text (as I do all the time), you 
need to put

  File.open("....", :encoding => "BINARY")

everywhere. And even then, if you ask the open File object what it's 
encoding is, it will say ASCII8BIT, even though you explicitly told it 
that it's BINARY.

This is because "BINARY" is just a synonym for "ASCII8BIT" in ruby. Of 
course, there is plenty of data out there which is not encoded using the 
American Standard Code for Information Interchange. MIME distinguishes 
clearly between 8BIT (text with high bit set) and BINARY (non-text). In 
terms of Ruby's processing it makes no difference, but it's annoying for 
Ruby to tell me that my data is text, when it is not.

Note: in more recent 1.9's, I believe that

  File.open("....", "rb")

has the effect of doing two things:
1. Disabling line-ending translation under Windows
2. Setting encoding to ASCII8BIT

So this may be sufficient for your needs, and it has the advantage that 
the same code will run under ruby <1.9.
-- 
Posted via http://www.ruby-forum.com/.