On Thu, Aug 08, 2002 at 06:28:34AM +0900, Marcin 'Qrczak' Kowalczyk wrote:
> Wed, 7 Aug 2002 16:41:18 +0900, Curt Sampson <cjs / cynic.net> pisze:
> 
> > UTF-8 is much less compact than UTF-16 for Asian text.
> 
> Well, not that much: at most 3/2 times larger.
> 
> > And in UTF-16, surrogate pairs are encoded with 4 bytes, whereas
> > they take 6 bytes in UTF-8.
> 
> No, there are no surrogates in UTF-8. Characters above U+FFFF are
> encoded in 4 bytes each. Surrogates exist only in UTF-16.
> 
> Anyway, if variable width is not a problem (and you say it isn't if
> you defend UTF-16), I would almost always choose UTF-8 as the default.
> Yes, up to 3/2 larger for Asian text, but twice more compact for ASCII,
> free of endianness issues, and ASCII-compatible which is very important.

UTF-8 is 50% larger than UTF-16 only for text which consist only of
Asian characters. Usual Asian document contains both Asian and ASCII
characters. In case of markup, like HTML, ASCII strongly outnumbers
Asian characters.

Example:
http://www.ruby-lang.org/ja/whats.html (jis)  4074
iconved to euc				      3756
iconved to utf-8			      4304
iconved to utf-16			      6418
iconved to utf-32			     12836

If we assume that ASCII take 1 byte in both EUC and UTF8,
and Asian characters take 2 bytes in EUC and 3 in UTF-8,
numbers we get are:

ASCII       2660
Asian	    548
Proportion: 4.85 : 1