On Sep 10, 2008, at 12:55 AM, Tanaka Akira wrote:

> "" is a character, even if it is represented as two
> codepoints.
>
> So ruby should treat it as a character.

Unfortunately it's generally impossible to agree on a definition of  
"character".  There is no notion of character that is compatible  
across all the different systems for encoding text in all the world's 
languages; unfortunate but true.

Also, remember, there is no system of character encoding in which  
"characters" correspond exactly to units of input or or storage or  
display.  At least unicode is honest about this.

> I know current ruby doesn't do that.  But it is desirable.
>
> NFC (Normalization Form C) can be a solution for "".  But
> there are characters which don't have single codepoint (as
> some characters defined in JIS X 0213, for example).

Unfortunately NFC isn't a solution because it isn't widely respected, 
so a developer has to deal with nonstandard normalizations. :(

> I think codepoint is implementation details.  Although it
> may be useful for unicode experts, non-experts will be
> confused with the difference of characters and codepoints.
> I think it should not be provided by default.

I agree.  The programmer who wants to search a list for filenames like  
%r{/([^/]*\.atom)$}, or the zoo catalog for "" should never have  
to think carefully about what a "character" is.  But someone who's  
writing an XML/HTML processor or a typesetter or a full-text search  
system has to think about these things all the time.  And when the  
text is traveling over the internet, the probability is quite high  
that it's in Unicode.

I don't want to weaken ruby's support for legacy encodings.  I just  
want to make it straightforward and efficient to process Unicode.  And  
for that, you need straightforward and efficient access to  
codepoints.  BTW, one reason why Unicode is popular on the Internet is  
that its notion of a character (a 21-bit number, with some associated 
metadata and a suggested graphic rendition) is not perfect, but it's  
clean and clear and makes it possible for programmers to produce  
decent results with multilingual texts.

  -Tim