On Thu, Nov 25, 2010 at 1:37 PM, Phillip Gawlowski
<cmdjackryan / googlemail.com> wrote:
> On Thu, Nov 25, 2010 at 12:56 PM, Robert Klemme
> <shortcutter / googlemail.com> wrote:
>>
>>> Since UTF-8 is a subset of UTF-16, which in turn is a subset of
>>> UTF-32,
>>
>> I tried to find more precise statement about this but did not really
>> succeed.  ¨Β τθουηθαμΥΤΖ­χεςε κυστ διζζεςεξεξγοδιξη ζοςνοζ
>> the same universe of code points.
>
> It's an implicit feature, rather than an explicit one:
> Wester languages get the first 8 bits for encoding. Glyphs going
> beyond the Latin alphabet get the next 8 bits. If that isn't enough, n
> additional 16 bits are used for encoding purposes.

What bits are you talking about here, bits of code points or bits in
the encoding?  It seems you are talking about bits of code points.
However, how these are put into any UTF-x encoding is a different
story and also because UTF-8 knows multibyte sequences it's not
immediately clear whether UTF-8 can only hold a subset of what UTF-16
can hold.

> Thus, UTF-8 is a subset of UTF-16 is a subset of UTF-16. Thus, also,
> the future-proofing, in case even more glyphs are needed.

Quoting from http://tools.ietf.org/html/rfc3629#section-3

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So we have for code point encoding

7 bits
6 + 5 = 11 bits
2 * 6 + 4 = 16 bits
3 * 6 + 3 = 21 bits

This makes 2164864 (0x210880) possible code points in UTF-8.  And the
pattern can be extended.

Looking at http://tools.ietf.org/html/rfc2781#section-2.1 we see that
UTF-16 (at least this version) supports code points up to 0x10FFFF.
This is less than what UTF-8 can hold theoretically.

Coincidentally 0x10FFFF has 21 bits which is what fits into UTF-8.

I stay unconvinced that UTF-8 can handle a subset of code points of
the set UTF-16 can handle.

I also stay unconvinced that UTF-8 encodings are a subset of UTF-16
encodings.  This cannot be true because in UTF-8 the encoding unit is
one octet, while in UTF-16 it's two octets.  As a practical example
the sequence "a" will have length 1 octet in UTF-8 (because it happens
to be an ASCII character) and length 2 octets in UTF-16.

"All standard UCS encoding forms except UTF-8 have an encoding unit
larger than one octet, [...]"
http://tools.ietf.org/html/rfc3629#section-1

>>> (at least, ISO learned from the
>>> mess created in the 1950s to 1960s) so that new glyphs won't ever
>>> collide with existing glyphs, my point still stands. ;)
>>
>> Well, I support your point anyway.  ¨Βθαχακυστ νεαξασ γαφεατ σο
>> people are watchful (and test rather than believe). :-)  ¨Βυτ ασ >> think about it it more likely was a statement about Java's
>> implementation (because a char has only 16 bits which is not
>> sufficient for all Unicode code points).
>
> Of course, test your assumptions. But first, you need an assumption to
> start from. ;)

:-)

Cheers

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/