On Tue, 09 Sep 2008 03:43:54 +1000, NARUSE, Yui <naruse / airemix.jp> wrote:

> Hi,
>
> Michael Selig wrote:
>> On Mon, 08 Sep 2008 19:45:36 +1000, Yukihiro Matsumoto  
>> <matz / ruby-lang.org> wrote:
>>
>>> |1) Maybe I am blind, but I cannot find something like  
>>> String#each_code to
>>> |return an Enumerator of the Unicode codepoints as fixnums. Is there  
>>> such a
>>> |beast? If not, I think there should be, considering that there is a
>>> |String#each_byte. (Yes, you can use String#each_char and then  
>>> String#ord
>>> |on each). Also I think there should be an equivalent of  
>>> String#setbyte &
>>> |getbyte for unicode codepoints (String#setcode & getcode?).
>>>
>>> each_code is ambiguous for me.  codepoint?
>> Is "each_codepoint" too long?
>
> When you use each_code?
> If you want to use it to iterate CHARACTERS, they may go wrong.
>
> You know, there are combined characters in Unicode which have one or more
> codepoints.  In other words, A character may consist from codepointS.
>
> Moreover in other than Unicode, codepoint is not a important component.
> In EUC-JP or Shift_JIS, they are only an identifier of characters:
> "\xA2\xA4" is codepoint 0xA2A4 ... are they useful?
Having a way of easily iterating through the codepoints (or whatever you  
want to call them when not applied to Unicode) as numbers IS useful,  
especially when processing variable-length character encodings. Also a way  
of manipulating them as numbers "in-place" without having to unpack them  
to an array first is useful to me.

> Another reason is, GB18030 has characters consisted from 4 bytes.
> They may 32bit width, but Fixnum is 31bit in 32bit environment.
>
> So we don't want to debuet codepoints on the main stage.
I am not an expert on this encoding, but all I was suggesting is returning  
the same value as "String#ord" does now for a single character. Maybe  
String#ord is wrong for GB18030? Please look at the Ruby 1.9 source file  
enc/gb18030.c, function gb18030_mbc_to_code(). The last 2 lines say:
	n &= 0x7FFFFFFF;
	return n;
So this function only returns 31 bits.


>>> |3) Suggestion: when opening a file with mode "b" (binary), I think the
>>> |encoding should be automatically set to 8-bit ASCII, overriding the
>>> |default locale. I think this should happen on Linux & Unix as well.  
>>> That
>>> |way "IO#readchar" and others will only try to do byte-by-byte  
>>> processing
>>> |(I hope!).
>>>
>>> For historical reason, "b" does not mean "binary" encoding, but
>>> suppression of newline processing (\r\n -> \n).  It has nothing to do
>>> with 8bit-ascii.
>> Before the days of Unicode you are right. But the options "t" and "b"  
>> to the C function fopen() *do* stand for text and binary in Windows.  
>> When Windows started to support Unicode however, "binary mode" also  
>> means that when reading multi-byte character files, conversion to  
>> "wide-chars" is suppressed. See  
>> http://msdn.microsoft.com/en-us/library/c4cy2b8e.aspx . In Ruby 1.9,  
>> 8-bit ascii is the closest thing we have to a byte string - in fact  
>> BINARY is an alias for ASCII-8BIT in Ruby's standard encodings.
>
> In Ruby 1.9, "t" is the flag of universal newline.
> So "b" in Ruby 1.9 is defined with that context.
>
> This is different from Windows with Unicode context.
>
> Anyway you can use
>    open("foo.txt", "rb:ASCII-8BIT"){|f|print f.read}
>
Yes, I know you can use that. I was just questioning whether it makes  
sense for Ruby to stick with "b" meaning simply the newline & end of file  
handling, without changing the character handling also. I felt that it  
makes much more sense that IO#readchar (et al) return bytes when the file  
is opened with "b". That behaviour may also be more backward compatible  
with 1.8.

Mike.