On Fri, 2 Aug 2002, MikkelFJ wrote:

> > Yeah. This is getting into complex nightmare city. That's why I'd prefer
> > to have the basic system just work completely in Unicode. One could have
> > a separate character system (character and string classes, byte-stream
> > to char converters, etc.) to work with this tagged format if one wished.
>
> But isn't this what matz suggest?
> Each stream is tagged, that is the same as having different types. It's
> basically just  a different way to store the type while having a lot of
> common string operations.

No, because then you have to deal with conversions. Most popular
character sets are convertable to Unicode and back without loss. That is
not true of any arbitrary pair of character sets, though, even if you go
through Unicode.

The reason for this is as follows. Say character set Foo has split
a unified hanji, "a", and also has "A". When converting to Unicode,
that "A" will be preserved because it's assigned a code point in
a compatability area, and when you convert back from Unicode, that
"A" will be translated to "A" in Foo. However, if character set Bar
does not have "A", just "a", the "A" will be converted to "a". When you
go from Bar back to Unicode, you end up with "a" again because there's
no way to tell that it was originally "A" when you converted out.

But there's an even better reason than this for converting to
Unicode on input, rather than doing internal tagging. If you don't
have conversion tables for a particular character encoding, it's
much better to find out at the time you try to get the information
in to the system than at some arbitrary later point when you try
to do a conversion. That way you know where the problem information
is coming from.

In terms of interface, I would say:

    1. Continue to use String as it is for "binary" data. This is
    efficient, if you don't need to do much processing.

    2. Add a UString or similar for dealing with UTF-16 data. There's
    no need for surrogate support in this, for reasons I will get into
    below, so this is straight fixed width. Reasonably efficient (almost
    maximally efficient for those of us using Asian languages :-)) and
    very easy to use.

    3. Add other, specialized classes when you need to do special
    purpose things. No need for this in the standard distribution.

> BTW: Unicode is not a fixed with format.

In terms of code values, it is fixed width. However, some characters are
represented by pairs of code values.

> ...but there are escape codes...

No, there are no escape codes. The high and low code values for
surrogate characters have their own special areas, and so are easily
identifiable.

> and options for future extensions.

Not that I know of. Can you explain what these are?

> Hence UCS-4 is a strategy with limited timespan.

Not at all, unless they decide to change Unicode to the point where it
no longer uses 16-bit code values, or add escape codes, or something
like that. That would be backward-incompatable, severely complicate
processing, and generally cause all hell to break lose. So I'd rate
this as "Not Likely."

Here are a few points to keep in mind about Unicode processing:

    1. The surrogate pairs are almost never used. Two years ago
    there weren't even any characters assigned to those code points.

    2. There are many situations where, even if surrogate pairs
    are present, you don't know or care, and need do nothing to
    correctly deal with them.

    3. Broken surrogate pairs are not a problem; the standard says you
    must be able to ignore broken pairs, if you interpret surrogate
    pairs at all.

    3. The surrogate pairs are extremely easy to distinguish, even
    if you don't interpret them.

    4. The code for dealing with surrogate pairs well (basically,
    not breaking them) is very simple.

The implication of point 1 is that one should not spend a lot of effort
dealing with surrogate pairs, as very few users will ever use them. Very
few Asian users will ever use them in their lifetimes, in fact.

The implication of points 2 and 3 are that not everything that deals
with Unicode has to deal with, or even know about, surrogate pairs. If
you are writing a web application, for example, your typical fields you
just take as a whole from the web browser or database, and give as a whole
to the web browser or database. Thus only the web browser really has any
need at all to deal with surrogate pairs.

If you take a substring of a string and in the process end up with
a surrogate pair half on either end, that's no problem. It just
gets ignored by whatever facilities deal with surrogate pairs, or
treated as an unknown character by those that don't (rather than
two unknown characters for an unsplit surrogate pair).

The only time you really run into a problem is if you insert
something into a string; there's a chance you might split the
surrogate pair, and lose the character. This is pretty uncommon
except in interactive input situations, though, where you know how
to handle surrogate pairs and can avoid doing this, or where you
don't know and the user can't see the characters anyway.

Well, another area you can run into problems with is line wrapping, but
there's no single algorithm for that anyway, and plenty of algorithms
break on languages for which they were not designed. So there you should
add some very simple code that avoids splitting surrogate pairs. (This
code is much simpler than the line wrapping code anyway, so it's hardly
a burden.) That shows the advantages of points 3 and 4 (essentially the
same point).

So I propose just what the Unicode standard itself proposes in
section 5.4: UString (or whatever we call it) should have the
Surrogate Support Level "none"; i.e., it completely ignores the
existence of surrogate pairs. Things that use UString that have
the potential to encounter surrogate pair problems or wish to
interpret them can add simple or complex code, as they need, to
deal with the problem at hand. (Many users of UString will need to
do nothing.)

Note that there's a big difference between this and your UTF-8
proposal: ignoring multibyte stuff in UTF-8 is going to cause much,
much more lossage because there's a much, much bigger chance of
breaking things when using Asian languages. With UTF-16, you probably
won't even encounter surrogates, whereas with Japanese in UTF-8,
pretty much every character is multibyte.

cjs
-- 
Curt Sampson  <cjs / cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC