On 15/06/06, Juergen Strobel <strobel / secure.at> wrote: > On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote: ... > > It could be up to six bytes at one point. However, I think that there > > is still support for surrogate characters meaning that a single glyph > > *might* take as many as eight bytes to represent in the 1-4 byte > > representation. Even with that, though, those are rare and usually > > user-defined (private) ranges IIRC. This also doesn't deal with > > (de)composed glyphs/combining glyphs. > > No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for > all characters. Only Java may need more than that because of their use > of UTF16 surrogates and special \0 handling in an intermediary step. See Austin's correct about six bytes, actually. The original UTF-8 specification *was* for up to six bytes: http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt However, no codepoints were ever defined in the upper part of the range, and once Unicode was officially restricted to the range 1-0x10FFFF, there was no longer any need for the five- and six-byte sequences. Compare RFC 2279 from 1998 (six bytes) http://tools.ietf.org/html/2279 and RFC 3629 from 2003 (four bytes) http://tools.ietf.org/html/3629 That Java encoding (UTF-8-encoded UTF-16) isn't really UTF-8, though, so you'd never get eight bytes in valid UTF-8: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. (RFC 3629) Paul.