On Wed, 7 Aug 2002, Marcin 'Qrczak' Kowalczyk wrote:

> The most straightforward internal representation is UTF-32: each
> character is stored in 4 bytes. All other encodings are either
> variable-length or can't represent all Unicode characters.

UTF-8, UTF-16 and UTF-32 are all able to represent all Unicode
characters, and are all variable length in one sense or another.
(UTF-32 still has combining characters.)

> If you need compactness or ASCII compatibility, use UTF-8.

UTF-8 is much less compact than UTF-16 for Asian text.

> Most characters (i.e. below U+FFFF) are encoded with 1, 2 or 3 bytes.

As opposed to UTF-16, where you can change that statement to "1 or 2 bytes".

And in UTF-16, surrogate pairs are encoded with 4 bytes, whereas
they take 6 bytes in UTF-8.

cjs
-- 
Curt Sampson  <cjs / cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC