On Sep 10, 2008, at 7:20 AM, NARUSE, Yui wrote:

>> Yes, there are lots of others.  For example, a full-text indexing =
=20
>> system dealing with a word like Qu=E9bec, which needs to index it =
the =20
>> same whether the =E9 appears as one codepoint or two.
>
> I don't know datails of full-text indexing systems,
> so can you teach me why one or two codepoints can normalize but one=
 =20
> character
> strings can't.

If you're reading a Unicode text, then for all the possible =20
combination of e and E and =E9 and =C9 and e + trailing-accent and E =
+ =20
trailing-accent you need to index this character in such a way that i=
t =20
will be found by any of the other combinations.  You can't possibly d=
o =20
this without access to the codepoints.  I don't think it's possible =
=20
for any language to handle all the low-level normalization and magic,=
 =20
especially since some of it is language and locale-specific.  When I'=
m =20
doing low-level text processing, if the text is Unicode I need =20
efficient access to the codepoints.  It doesn't seem like much to ask=
.

>> Actually, for many programmers working in Unicode, what they need =
=20
>> isn't String#each_codepoint but IO#each_codepoint, because with =
=20
>> variable-length encodings it would be very nice if the library too=
k =20
>> care of the necessary buffer juggling.
>
> You can get a character whether the io's encoding is fixed-length =
=20
> encoding
> or variale-length encoding.  Isn't this enough?

How?  IO#each_char produces strings, which is perfectly reasonable.  =
-=20
Tim