On 15/06/06, Juergen Strobel <strobel / secure.at> wrote:
> On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:
...
> > It could be up to six bytes at one point. However, I think that there
> > is still support for surrogate characters meaning that a single glyph
> > *might* take as many as eight bytes to represent in the 1-4 byte
> > representation. Even with that, though, those are rare and usually
> > user-defined (private) ranges IIRC. This also doesn't deal with
> > (de)composed glyphs/combining glyphs.
>
> No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
> all characters. Only Java may need more than that because of their use
> of UTF16 surrogates and special \0 handling in an intermediary step. See

Austin's correct about six bytes, actually. The original UTF-8
specification *was* for up to six bytes:
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

However, no codepoints were ever defined in the upper part of the
range, and once Unicode was officially restricted to the range
1-0x10FFFF, there was no longer any need for the five- and six-byte
sequences.

Compare RFC 2279 from 1998 (six bytes)
http://tools.ietf.org/html/2279
and RFC 3629 from 2003 (four bytes)
http://tools.ietf.org/html/3629

That Java encoding (UTF-8-encoded UTF-16) isn't really UTF-8, though,
so you'd never get eight bytes in valid UTF-8:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters. (RFC 3629)

Paul.