On Sep 8, 2008, at 10:43 AM, NARUSE, Yui wrote:

>>> each_code is ambiguous for me.  codepoint?
>> Is "each_codepoint" too long?
>
> When you use each_code?
> If you want to use it to iterate CHARACTERS, they may go wrong.

In Unicode, each_character and each_codepoint would mean the same  
thing. For serious low-level text processing in Unicode, you really  
must have the codepoint-by-codepoint access.  By  "serious" I mean  
things like full-text-indexers, markup parsers, and typesetting  
software.   It might be useful to look at http://www.w3.org/TR/ 
charmod/ - individual "characters" do not correspond to units of  
display, nor to units of sound, nor to units of input, nor to units of  
collation, nor to units of storage.   Developers simply have to live  
with these facts, and once again, in Unicode, the only way to stay  
sane is to process a codepoint at a time.  In my work on efficient and  
(unlike REXML) correct XML parsing, I had to do everything via  
String#unpack, very memory-inefficient, when all I wanted was next- 
Unicode-codepoint.

> You know, there are combined characters in Unicode which have one or  
> more
> codepoints.  In other words, A character may consist from codepointS.

Actually, that's not correct.  There are some characters in Unicode  
which are supposed to be combined visually when displayed.   
Fortunately, these characters are not typically used as syntax markers  
that the authors of parsers and scanners care about very much.  And  
furthermore, if you are processing for the purposes of full-text  
indexing or visual display, you still need the codepoint-by-codepoint  
access to produce correct and useful results, at least in my experience.

> Moreover in other than Unicode, codepoint is not a important  
> component.
> In EUC-JP or Shift_JIS, they are only an identifier of characters:
> "\xA2\xA4" is codepoint 0xA2A4 ... are they useful?
>
> Another reason is, GB18030 has characters consisted from 4 bytes.
> They may 32bit width, but Fixnum is 31bit in 32bit environment.
>
> So we don't want to debuet codepoints on the main stage.

Well, I would rephrase that: "You only want to debut codepoints on the  
main stage if you want to make life easy for people doing serious text  
processing in Unicode."  For me, a language that doesn't support  
serious Unicode text processing is unsatisfactory, but maybe I'm in a  
minority.

  -Tim