Hi,

Michael Selig wrote:
> On Mon, 08 Sep 2008 19:45:36 +1000, Yukihiro Matsumoto 
> <matz / ruby-lang.org> wrote:
> 
>> |1) Maybe I am blind, but I cannot find something like 
>> String#each_code to
>> |return an Enumerator of the Unicode codepoints as fixnums. Is there 
>> such a
>> |beast? If not, I think there should be, considering that there is a
>> |String#each_byte. (Yes, you can use String#each_char and then String#ord
>> |on each). Also I think there should be an equivalent of String#setbyte &
>> |getbyte for unicode codepoints (String#setcode & getcode?).
>>
>> each_code is ambiguous for me.  codepoint?
> Is "each_codepoint" too long?

When you use each_code?
If you want to use it to iterate CHARACTERS, they may go wrong.

You know, there are combined characters in Unicode which have one or more
codepoints.  In other words, A character may consist from codepointS.

Moreover in other than Unicode, codepoint is not a important component.
In EUC-JP or Shift_JIS, they are only an identifier of characters:
"\xA2\xA4" is codepoint 0xA2A4 ... are they useful?

Another reason is, GB18030 has characters consisted from 4 bytes.
They may 32bit width, but Fixnum is 31bit in 32bit environment.

So we don't want to debuet codepoints on the main stage.


>> |3) Suggestion: when opening a file with mode "b" (binary), I think the
>> |encoding should be automatically set to 8-bit ASCII, overriding the
>> |default locale. I think this should happen on Linux & Unix as well. That
>> |way "IO#readchar" and others will only try to do byte-by-byte processing
>> |(I hope!).
>>
>> For historical reason, "b" does not mean "binary" encoding, but
>> suppression of newline processing (\r\n -> \n).  It has nothing to do
>> with 8bit-ascii.
> Before the days of Unicode you are right. But the options "t" and "b" to 
> the C function fopen() *do* stand for text and binary in Windows. When 
> Windows started to support Unicode however, "binary mode" also means 
> that when reading multi-byte character files, conversion to "wide-chars" 
> is suppressed. See http://msdn.microsoft.com/en-us/library/c4cy2b8e.aspx 
> . In Ruby 1.9, 8-bit ascii is the closest thing we have to a byte string 
> - in fact BINARY is an alias for ASCII-8BIT in Ruby's standard encodings.

In Ruby 1.9, "t" is the flag of universal newline.
So "b" in Ruby 1.9 is defined with that context.

This is different from Windows with Unicode context.

Anyway you can use
   open("foo.txt", "rb:ASCII-8BIT"){|f|print f.read}

-- 
NARUSE, Yui  <naruse / airemix.jp>