James Edward Gray II wrote:
> UTF-8, UTF-16, and UTF-32 are encodings of Unicode code points.  They are all capable of representing all code points.  Nothing in this discussion is a subset of anything else.

To add to this, Unicode 3 uses the codespace from 0 to 0x10FFFF (not 0xFFFFFFFF),
so it does cover all the Oriental characters (unlike Unicode 2 as implemented in
earlier Java versions, which only covers 0..0xFFFF). It even has codepoints for
Klingon and Elvish!

UTF-8 requires four bytes to encode a 21-bit number (enough to encode 0x10FFFF)
though if you extend the pattern (as many implementations do) it has a 31-bit gamut.

UTF-16 encodes the additional codespace using surrogate pairs, which is a pair of
16-bit numbers each carrying a 10-bit payload. Because it's still a variable length
encoding, it's just as painful to work with as UTF-8, but less space-efficient.

Both UTF-8 and UTF-16 encodings allow you to look at any location in a string and step
forward or back to the nearest character boundary - a very important property that
was missing from Shift-JIS and other earlier encodings.

If you go back to 2003 in the archives, you'll see I engaged in a long and somewhat
heated discussion about this subject with Matz and others back then. I'm glad we
finally have a Ruby version that can at least do this stuff properly, even though
I think it's over-complicated.

Clifford Heath.