Perry Smith wrote: >> A general hint for debugging encoding troubles: the UTF-8 encoding >> *guarantees* that every Unicode codepoint is *either* encoded into a >> *single* octet with its most significant bit cleared to 0 (i.e. a >> decimal value between 0 and 127) *or* into a *sequence* of 2 to 6 >> octets, *all* of which have their MSB set to 1 (i.e. a decimal value >> between 128 and 255). > Question: The sequence of 2 to 6 octets: is it always even? i.e. 2, 4, > or 6 but not 3 nor 5 octects? Nope. First off: I was wrong, the longest encoding is actually 4 octets, not 6. (I was confused by the algorithm: the algorithm actually allows for up to 8 bytes, but because of the way Unicode characters are allocated, and UTF-8 is defined, it is guaranteed that there will never be more than 4.) The encodings look like this: 0xxxxxxx for ASCII 110xxxxx 10xxxxxx for U+80 to U+7FF 1110xxxx 10xxxxxx 10xxxxxx for U+800 to U+FFFF and 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx for U+1000 to U+1FFFFF This is actually pretty clever: * you can always tell whether you are inside a multibyte sequence or not because of the high bit, * you can always tell whether a byte in the sequence is the first one or a later one, because the first one always starts with 11 and the other ones always start with 10 and * you can always tell how long a sequence is by the number of 1 bits in the start byte: two-byte sequences start with two 1s, three-byte sequences start with three 1s and four-byte sequences start with four 1s. This means that you can usually re-synchronize pretty easily from the middle of a corrupted network transmission, for example. You can also jump over bytes if you are counting the length. jwm