On Tue, 13 Aug 2002, Bret Jolly wrote:

>    UTF-8 parsers must ignore "broken" characters because, as I pointed
> out in a previous message, "broken" characters are never valid UTF-8,
> due to the UTF-8 design.  The standard now only allows parsing of
> valid characters (the loopholes that existed in unicode version 3.0
> were eliminated by updates in versions 3.1 and 3.2). The unicode
> standard expressly forbids the interpretation of illegal UTF-8
> sequences.

Ah. So does this mean that if I break a String into two in the
middle of a UTF-8 sequence both broken sequence parts will be
preserved, so that the character reappears if I put the two strings
back together again? This, to my mind, is one of the big advantages of
the UTF-16 surrogate character specification.

>    There are also advantages to a fixed-width encoding, such as the
> recently introduced UTF-32....

I think I've already said this about eight million times, but:

    UTF-32 is not fixed width, due to combining characters.

> But UTF-16 in both big-endian and little-endian variants is sure to
> be one of those technical blunders which far outlives its
> excusability....

Well, we'll just have to agree to differ on this. I deal with a lot of
Japanese text in my various programs, and at the lowest level (String
objects and suchlike) I find UTF-16 to be by far the most convenient
way of dealing with with it. It's small, efficient, lets me do basic
handling of stuff with ease, and lets me push up some of the harder
issues into just the classes that really need it, rather than having to
deal with them everywhere.

>     Though notoriously unwise myself, I'd like to make a plea for
> some wisdom.  Many people here have a great deal of experience with
> internationalization, and rightly consider themselves experts.  But
> expertise comes in many flavors, and one should think twice before
> making assertions about what *other* people need.  The need for
> internationalization, M17n, and so forth by a maker of corporate web
> sites is different from the need of a mathematician, musician, or
> someone trying to computerize Akkadian tablets.  We should avoid the
> parochial thought that our interests are the only important or
> "practical" ones.

Well, I've said all along that Unicode just is not suitable for a
lot of very technical purposes. My argument is that it's *impossible*
for a single character set to deal with everything, and even dealing
with most of it is completely impractical. Thus, use a simple
character set like Unicode and it's relatively simple accompanying
algorithms for day to day work, and do something custom when you
have requirements beyond that.

cjs
-- 
Curt Sampson  <cjs / cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC