At 18:07 08/09/10, Manfred Stienstra wrote:
>
>On Sep 10, 2008, at 9:55 AM, Tanaka Akira wrote:
>
>>> Yes, there are lots of others.  For example, a full-text indexing
>>> system dealing with a word like Qu˝├ec, which needs to index it the
>>> same whether the appears as one codepoint or two.
>>
>> " is a character, even if it is represented as two
>> codepoints.

For some users maybe yes. For some programmers, maybe no.
Both U+0065, LATIN SMALL LETTER E, and
U+0301, COMBINING ACUTE ACCENT, are characters in their
own right, not just codepoints.

>> So ruby should treat it as a character.

I don't think #each_character should do that, although
its name may suggest so. What may happen (maybe for Ruby 2.0)
is that we have a parameter to #each_character which, if present,
leads to lumping the above two characters together.


>Yes, but that gets really complicated really fast [1]. And that's not  
>even considering locale dependent features.

Yes. And from Japan, you don't have to go very far, only to Korea,
to meet people with totally different ideas of what a character is.
Some people want to process Hangul (even if it's comming in one
syllable per character) as a sequence of Jamo, others want to
process syllables (even if they are encoded as Jamos), and so on.
And in India and South East Asia, the situation is even more
complicated. While the Japanese writing system is in many ways
the most complicated in the world, this is a complitation that
it (mostly) doesn't have.

Regards,    Martin.


>[1] http://unicode.org/reports/tr29/tr29-13.html


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp