At 23:08 08/09/10, NARUSE, Yui wrote:
>Michael Selig wrote:

>> Having a way of easily iterating through the codepoints (or whatever you want to call them when not applied to Unicode) as numbers IS useful, especially when processing variable-length character encodings. Also a way of manipulating them as numbers "in-place" without having to unpack them to an array first is useful to me.

>Why each_char and String#ord isn't enough?

In my experience, in many cases where each_char is really
needed (rather than somebody using it because they don't
know about a convenient higher-level function to do their
job, such as a #gsub or such), people will use #ord anyway,
because they need the integer value, or they will prefer
the integer value simply for efficiency reasons.

I think we had a similar discussion on ruby-dev with the people
who are doing image processing with Ruby, and we ended up with
#setbyte or some such not because it was *absolutely* necessary,
but because it was way more efficient than having to create
millions of one-byte strings. By analogy, we might end up
with #set_code[point], although my feel is that that's
a bit less useful than #setbyte, because strings are much
less rigid than images.


>>> Another reason is, GB18030 has characters consisted from 4 bytes.
>>> They may 32bit width, but Fixnum is 31bit in 32bit environment.
>>>
>>> So we don't want to debuet codepoints on the main stage.
>> I am not an expert on this encoding, but all I was suggesting is returning the same value as "String#ord" does now for a single character. Maybe String#ord is wrong for GB18030? Please look at the Ruby 1.9 source file enc/gb18030.c, function gb18030_mbc_to_code(). The last 2 lines say:
>>     n &= 0x7FFFFFFF;
>>     return n;
>> So this function only returns 31 bits.
>
>This can't be the reason that only 31bits are used.
>This code is cutting off 32nd bit to put in 31bits.

Sorry, I don't understand. Either it's fine to cut off
the top bit, both for #ord and for a new #each_char
(or what we are going to call it) or it's not okay. (*)
In fact, if #each_char and #ord would return different
values for the same character, that would be really strange.

(*) My understanding is that as long as the user knows that's
    happening, it's fine, because there is no possibility
    to conflate two different codepoints.

Regards,    Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp