Perry Smith wrote:
>> A general hint for debugging encoding troubles: the UTF-8 encoding
>> *guarantees* that every Unicode codepoint is *either* encoded into a
>> *single* octet with its most significant bit cleared to 0 (i.e. a
>> decimal value between 0 and 127) *or* into a *sequence* of 2 to 6
>> octets, *all* of which have their MSB set to 1 (i.e. a decimal value
>> between 128 and 255).
> Question: The sequence of 2 to 6 octets: is it always even?  i.e. 2, 4, 
> or 6 but not 3 nor 5 octects?

Nope.

First off: I was wrong, the longest encoding is actually 4 octets, 
not 6. (I was confused by the algorithm: the algorithm actually allows
for up to 8 bytes, but because of the way Unicode characters are
allocated, and UTF-8 is defined, it is guaranteed that there will
never be more than 4.)

The encodings look like this:

0xxxxxxx                            for ASCII
110xxxxx 10xxxxxx                   for U+80 to U+7FF
1110xxxx 10xxxxxx 10xxxxxx          for U+800 to U+FFFF and 
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF

This is actually pretty clever:

* you can always tell whether you are inside a multibyte sequence or 
  not because of the high bit,
* you can always tell whether a byte in the sequence is the first one 
  or a later one, because the first one always starts with 11 and the 
  other ones always start with 10 and 
* you can always tell how long a sequence is by the number of 1 bits 
  in the start byte: two-byte sequences start with two 1s, three-byte 
  sequences start with three 1s and four-byte sequences start with 
  four 1s.

This means that you can usually re-synchronize pretty easily from the
middle of a corrupted network transmission, for example. You can also
jump over bytes if you are counting the length.

jwm