On Sat, 3 Aug 2002, MikkelFJ wrote:

> Obviously you know more about Unicode than most. What is the practical
> difference between UCS-4, UCS-2 and UTF-16.

I don't have my spec. handy, so I'm going from memory here; someone with
the spec in front of him should correct me if I'm wrong.

UCS-4 is a 4-byte encoding, and UCS-2 is a two-byte encoding for ISO-10646.
UCS-2 is similar to UTF-16, which is a Unicode encoding.

> Is it that "extended
> characerts" - or surrogates - will take on more space than UCS-4 but
> typically take up the same space as UCS-2?

All characters take up 4 bytes in UCS-4. Each code value takes up
two bytes in UCS-2 and UTF-16; some characters need two code values.

> > Not at all, unless they decide to change Unicode to the point where it
> > no longer uses 16-bit code values, or add escape codes, or something
> > like that. That would be backward-incompatable, severely complicate
> > processing, and generally cause all hell to break lose. So I'd rate
> > this as "Not Likely."
>
> It wouldn't be the first time hell breaks loose in this area though.

It would for Unicode. I don't think they're likely to completely
break backwards compatability.

> >     2. There are many situations where, even if surrogate pairs
> >     are present, you don't know or care, and need do nothing to
> >     correctly deal with them.
>
> Does this means that UCS-2 is the best format?

In my opinion, yes.

> I did not mean so that you should ignore the content. But you can process it
> as if it were ASCII because in many languages everthing that is not text is
> found in the ASCII range. Due to the way UTF-8 is encoded you never risc
> getting a spurious ASCII character following this path. For example, you can
> find delimited text simply scanning from one double quote to the next.
> Everything in between is a sound text possibly in UTF-8 - you do not need to
> care about this.

Right. The same is true of UTF-16. However, UTF-16 has the advantage
that it's more compact when representing Japanese or other Asian
languages, and it's easier to manipulate individual characters.

cjs
-- 
Curt Sampson  <cjs / cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC