On Thu, 8 Aug 2002, Marcin 'Qrczak' Kowalczyk wrote:

> No, there are no surrogates in UTF-8. Characters above U+FFFF are
> encoded in 4 bytes each. Surrogates exist only in UTF-16.

Sorry; you're right.

> Anyway, if variable width is not a problem (and you say it isn't if
> you defend UTF-16),

Well, actually the point with UTF-16 is that you can, in general, safely
ignore the variable width stuff. I don't think you can do that so easily
in UTF-8. If I chop off a UTF-8 sequence in the middle, are applications
that read it required to ignore that, as they are with surrogates in
UTF-16? Or is it likely that they will break, instead?

Anyway, I'm open to various arguments on the use of UTF-8 vs. UTF-16. I
suspect that UTF-16 would be rather easier to use, but I've not actually
done a thorough analysis.

cjs
-- 
Curt Sampson  <cjs / cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC