On Thu, Aug 08, 2002 at 06:28:34AM +0900, Marcin 'Qrczak' Kowalczyk wrote: > Wed, 7 Aug 2002 16:41:18 +0900, Curt Sampson <cjs / cynic.net> pisze: > > > UTF-8 is much less compact than UTF-16 for Asian text. > > Well, not that much: at most 3/2 times larger. > > > And in UTF-16, surrogate pairs are encoded with 4 bytes, whereas > > they take 6 bytes in UTF-8. > > No, there are no surrogates in UTF-8. Characters above U+FFFF are > encoded in 4 bytes each. Surrogates exist only in UTF-16. > > Anyway, if variable width is not a problem (and you say it isn't if > you defend UTF-16), I would almost always choose UTF-8 as the default. > Yes, up to 3/2 larger for Asian text, but twice more compact for ASCII, > free of endianness issues, and ASCII-compatible which is very important. UTF-8 is 50% larger than UTF-16 only for text which consist only of Asian characters. Usual Asian document contains both Asian and ASCII characters. In case of markup, like HTML, ASCII strongly outnumbers Asian characters. Example: http://www.ruby-lang.org/ja/whats.html (jis) 4074 iconved to euc 3756 iconved to utf-8 4304 iconved to utf-16 6418 iconved to utf-32 12836 If we assume that ASCII take 1 byte in both EUC and UTF8, and Asian characters take 2 bytes in EUC and 3 in UTF-8, numbers we get are: ASCII 2660 Asian 548 Proportion: 4.85 : 1