At 09:42 08/09/18, Austin Ziegler wrote:
>On Wed, Sep 17, 2008 at 10:09 AM, Matthias Whter
><matthias / waechter.wiz.at> wrote:
>> Is there a complete characterization of this whole problem? It seems
>> to be the main reason for sticking to non-UTF-8 character sets in
>> Ruby these days, and concluding from what I have read about it, a
>> solution could be the addition of missing characters/codepoints to
>> Unicode. Why does no-one consider going that way, but instead builds
>> a complicated stack of functions for conversions on top level?
>
>While there is a private use plane, it's not generally interoperable
>to use the private use plane in Unicode.

Very much agreed. Private use areas (a small area in the BMP
(Base Multilingual Plane) and planes 15 and 16) are free-for-all,
which means you are never really sure what you get there.

(for those who want some more terse background reading, I
recommend http://www.w3.org/TR/charmod/#sec-PrivateUse)

>Adding glyphs to Unicode is a
>lengthy process that requires going through a standards body. The
>Unicode standard is updated every few years, but the Unicode
>consortium is much more likely to listen to the Japanese standards
>bodies than Ruby programmers.

Well, yes, first because the relevant Japanese standards body
is a member of ISO/IEC JTC1/SC2/WG2, the group responsible for
ISO 10646, which is in sync with Unicode. And second because
Ruby programmers as a group don't have any particular character
encoding needs.


>The reality is that Unicode *doesn't* completely represent all Asian
>languages well

True. There are still many (minor) scripts that are not yet encoded,
and most of them are used in Asia, in the same way as most of the
scripts already encoded are used in Asia.
(For more details, please see http://unicode.org/roadmaps/
and the links from there to the roadmaps for various parts
of Unicode.)

>(see the discussions around Han unification for a brief
>primer on the issues involved).

Complaints about Han unification are mostly unjustified. The discussion
e.g. around Internationalized Domain Names has shown that unification
has significant advantages. You get into problems when e.g. a Latin
'A', a Cyrillic 'A', and a Greek 'A' are encoded separately (as they
currently are, not the least because they are encoded separately in
some important East Asian standards).
I do not want to immagine the mess we would have if there were separate
codes for Chinese/Japanese/Korean (and maybe Vietnamese, Taiwanese,...)
"variants" of Han characters such as '' (one), '' (two), '' (three),
and so on.


>The problem is exacerbated in the
>academic arena where people want to be able to represent ancient
>characters accurately, but it's not limited to that.

Yes, and if you look at academic use, the same can be said for
the Western World. As a simple example, Unicode doesn't contain
codepoints for all the many ligatures used in the Gutenberg bible.
The only difference may be that researchers in the West are
more ready to use an additional layer (e.g. some XML markup or so)
for this, whereas in Asia, the fact that there is already
such a huge number of characters makes it very easy for people
to think that just adding more characters is the solution
for these problems.


>Just because you
>and I can represent our words in under one hundred characters doesn't
>mean that it's appropriate to do the same with others' languages.

Of course not. And Unicode definitely hasn't done that, quite to
the contrary.


Korean got more than 11,000 characters, of which by all accounts
less than 3000 are actually used, the only purpose of the rest being
to complete a nice-looking three-dimensional table.

Han characters currently count around 70,000, of which the majority
is mainly used in dictionaries, and many of them with entries of the
form (freely translated): "A: variant/misprint for B, see B."

Mind you, there are still a lot of Han characters (the core being about
21,000) that are really useful because they are supported on everyday
computer systems in China, Japan, Korea, and so on. And a smaller subset
of these (around 2000-3000 for Japanese, less for Korean, more for Chinese)
is what people actually use day in day out.


>It's getting better, but it's still not perfect.

Very much so indeed.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp