Michael Selig wrote: >> When you use each_code? >> If you want to use it to iterate CHARACTERS, they may go wrong. >> >> You know, there are combined characters in Unicode which have one or more >> codepoints. In other words, A character may consist from codepointS. >> >> Moreover in other than Unicode, codepoint is not a important component. >> In EUC-JP or Shift_JIS, they are only an identifier of characters: >> "\xA2\xA4" is codepoint 0xA2A4 ... are they useful? > Having a way of easily iterating through the codepoints (or whatever you > want to call them when not applied to Unicode) as numbers IS useful, > especially when processing variable-length character encodings. Also a > way of manipulating them as numbers "in-place" without having to unpack > them to an array first is useful to me. Why each_char and String#ord isn't enough? >> Another reason is, GB18030 has characters consisted from 4 bytes. >> They may 32bit width, but Fixnum is 31bit in 32bit environment. >> >> So we don't want to debuet codepoints on the main stage. > I am not an expert on this encoding, but all I was suggesting is > returning the same value as "String#ord" does now for a single > character. Maybe String#ord is wrong for GB18030? Please look at the > Ruby 1.9 source file enc/gb18030.c, function gb18030_mbc_to_code(). The > last 2 lines say: > n &= 0x7FFFFFFF; > return n; > So this function only returns 31 bits. This can't be the reason that only 31bits are used. This code is cutting off 32nd bit to put in 31bits. -- NARUSE, Yui <naruse / airemix.jp>