On Sat, 10 Aug 2002, Bret Jolly wrote:

> Unicode is no longer something that can be squeezed into two
> bytes, even for practical purposes.  There are over 40 000 CJK
> characters outside the "BMP", that require surrogates in UTF-16.

I agree. What we may not agree on is that surrogates work very well
in UTF-16, and require minimal or no processing in many cases.

> For example, the scandalous situation where many Chinese and
> Japanese cannot write their names in unicode will have to be fixed
> eventually...

How many people is this, really? I note that Japanese people have been
putting up with this "scandalous" situation for years now, and will
continue to do so far a long time, as Shift_JIS and EUC-JP encodings of
JIS X 208 and JIS X 212 are showing no signs of declining in use, and
they are both fully present in the Unicode BMP.

> But UTF-16 was a mistake from the beginning.  It is no longer fixed-
> width, and it is sure to grow much less fixed-width in practice....

UTF-32 is not fixed width, either, and never has been. Nothing can
be fixed width in Unicode due to combining characters.

The only "extra" problem that UTF-16 presents over UTF-32 is dealing
with surrogates, and that is a very easy problem to deal with.

> Yet it is just long enough to introduce an
> endianness nightmare.  The UTF-16 folks try to fix this with a kluge,
> the byte-order mark, but the kluge is an abomination.  It is non-local,
> and hence screws string processing.  It breaks unix's critical shebang
> hack.

Actually, that would not be hard to fix. It's pretty trivial to
modify the kernel to deal with that.

> ...and maybe unicode will never be right for
> everybody, so I think Ruby should support other character sets as well,
> including some which are not compatible with unicode.

I certainly agree with that!

cjs
-- 
Curt Sampson  <cjs / cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC