At 01:42 08/09/17, Tanaka Akira wrote:
>In article <6.0.0.20.2.20080916184943.08a281f0 / localhost>,
>  Martin Duerst <duerst / it.aoyama.ac.jp> writes:
>
>>>> So ruby should treat it as a character.
>>
>> I don't think #each_character should do that, although
>> its name may suggest so. What may happen (maybe for Ruby 2.0)
>> is that we have a parameter to #each_character which, if present,
>> leads to lumping the above two characters together.
>
>Unless doing that, single character in JIS X 0213 is mapped
>to two characters in Unicode.
>
>It is not desired result.

It depends very much on what kind of processing is going on.
Just arguing from a particular encoding doesn't really work.

As an example, windows-1258, used for Vietnamese, represents
some 'characters' decomposed in two, while they are available
precomposed in Unicode. Somebody might want to argue that in
order to allow windows-1258 processing on Unicode, Ruby has
to expose these decompositions.

Arguing from a single encoding may be useful to point to problems.
But it isn't useful to create some new solutions based just on a
few codepoints in a single encoding, in particular if that solution
is intended to completely replace a solution that may not be
perfect, but is currently very widely and well understood.

The work on the W3C character model (see in particular
http://www.w3.org/TR/charmod/#sec-stringIndexing, but
be warned, there is some very dense language) showed clearly
that APIs accessing single characters are not too frequent,
and should be even less frequent. As an example, an API
for case changes should work on strings (with single-
character strings as one usage example), not single
characters (as e.g. the traditional C APIs already do).
Ruby already does this right, or gives the user a chance
to do it right.

Considering this advice, of the remaining cases where an
application programmer should really use String#each_char
or something similar, for character-by-character access,
my guess is that low-level operations (codepoint by codepoint)
are easily as frequent as higher-level operations
(some variant of grapheme cluster).

Also, it's always possible to implement higher levels on top
of the low level with moderate effort, but doing the reverse
will produce a big mess.

On top of that comes the previously mentioned point that
for different usages and different people, different concepts
of "grapheme cluster" may be neccessary.

Also, Japanese information processing has lived for several
decades with the fact that in Half-width Kana, the base
letter and the modifier are separate characters
(e.g.  or , the later being the half-width equivalent
of the specific character in question (see below)).
So dealing with such a case can't be too hard, even if
it may not be optimal for all use cases.


At 19:12 08/09/17, Tanaka Akira wrote:
>In article <a5d587fb0809170303x71ebde31r8adae082b82af182 / mail.gmail.com>,
>  "Michal Suchanek" <hramrach / centrum.cz> writes:
>
>> Can I ask what character(s) that would be?
>
>KATAKANA LETTER SE WITH SEMI-VOICED SOUND MARK for example.
>
>This character is placed at JIS X 0213 plane 1 row 5 column 92.
>http://www.itscj.ipsj.or.jp/ISO-IR/228.pdf
>
>In unicode, it is represented by
>KATAKANA LETTER SE with
>COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK.

Just for information, the main use for this character/combination,
as far as I know, is for writing the Ainu language, a minority
language in the north of Japan. As of 1996, a total of 15
remaining speakers were reported
(see http://www.ethnologue.com/show_language.asp?code=ain),
but I don't know whether (and don't hope) that's true.

The average Japaneses' first thought when seing such a character
would probably be that it must be a misprint.
With this I don't want to say that support for characters like
this isn't important, quite to the contrary.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp