Michael Selig wrote:
>> When you use each_code?
>> If you want to use it to iterate CHARACTERS, they may go wrong.
>>
>> You know, there are combined characters in Unicode which have one or more
>> codepoints.  In other words, A character may consist from codepointS.
>>
>> Moreover in other than Unicode, codepoint is not a important component.
>> In EUC-JP or Shift_JIS, they are only an identifier of characters:
>> "\xA2\xA4" is codepoint 0xA2A4 ... are they useful?
> Having a way of easily iterating through the codepoints (or whatever you 
> want to call them when not applied to Unicode) as numbers IS useful, 
> especially when processing variable-length character encodings. Also a 
> way of manipulating them as numbers "in-place" without having to unpack 
> them to an array first is useful to me.

Why each_char and String#ord isn't enough?

>> Another reason is, GB18030 has characters consisted from 4 bytes.
>> They may 32bit width, but Fixnum is 31bit in 32bit environment.
>>
>> So we don't want to debuet codepoints on the main stage.
> I am not an expert on this encoding, but all I was suggesting is 
> returning the same value as "String#ord" does now for a single 
> character. Maybe String#ord is wrong for GB18030? Please look at the 
> Ruby 1.9 source file enc/gb18030.c, function gb18030_mbc_to_code(). The 
> last 2 lines say:
>     n &= 0x7FFFFFFF;
>     return n;
> So this function only returns 31 bits.

This can't be the reason that only 31bits are used.
This code is cutting off 32nd bit to put in 31bits.

-- 
NARUSE, Yui  <naruse / airemix.jp>