Curt Sampson <cjs / cynic.net> wrote in message news:<Pine.NEB.4.44.0208081139480.17422-100000 / angelic.cynic.net>...
> Well, actually the point with UTF-16 is that you can, in general, safely
> ignore the variable width stuff. I don't think you can do that so easily
> in UTF-8. If I chop off a UTF-8 sequence in the middle, are applications
> that read it required to ignore that, as they are with surrogates in
> UTF-16? Or is it likely that they will break, instead?
> 
   UTF-8 is designed so that you always know if you are in the 
middle of a character (provided that you know you are reading UTF-8). 
I.e., if you break a string of bytes in the middle of a character,
the resulting sequence of bytes will not be valid UTF-8. The mapping 
from unicode code points goes like this (using hexadecimal):

Unicode code point        UTF-8 byte sequence
00..7F                    (00..7F)                        (This row is ASCII)
80..7FF                   (C2..DF) (80..BF)               (2 bytes)
800..FFF                  (E0) (A0..BF) (80..BF)          (3 bytes)
1000..FFFF                (E1..EF) (80..BF)(80..BF)       (3 bytes)
10000..3FFFF              (F0)(90..BF)(80..BF)(80..BF)    (4 bytes)
40000..FFFFF              (F1..F3)(80..BF)(80..BF)(80..BF)(4 bytes)
100000..10FFFF            (F4)(80..8F)(80..BF)(80..BF)    (4 bytes)

Suppose for example that you truncate the character F1 87 B0 B1, 
losing the last byte and getting F1 87 B0.  This is a putative 
3-byte character, so it belongs in the third or fourth rows of 
the table above.  But such a character cannot start with the 
byte F1....

Unicode is no longer something that can be squeezed into two
bytes, even for practical purposes.  There are over 40 000 CJK
characters outside the "BMP", that require surrogates in UTF-16.
Mathematical alphanumeric symbols and musical symbols also live
outside the BMP.  A lot of growth is still necessary if unicode
is to fulfill its mission.  For example, the scandalous situation
where many Chinese and Japanese cannot write their names in unicode
will have to be fixed eventually, and this will be done outside
the BMP. More technical notation (such as Fregean notation in
logic, for which I personally feel a need) will have to be introduced,
and it won't be in the BMP.  Certain mistakes in unicode, such as
the bungled treatment of IPA, will have to be fixed, and they will
be fixed outside the BMP.  It is clear that some of the "unification"
that has occurred was driven mainly by an unrealistic desire to
cram all the world's characters into two bytes.  The misguided
unifications will certainly be rectified outside the BMP.

But UTF-16 was a mistake from the beginning.  It is no longer fixed-
width, and it is sure to grow much less fixed-width in practice, so
it lacks that merit.  Yet it is just long enough to introduce an
endianness nightmare.  The UTF-16 folks try to fix this with a kluge,
the byte-order mark, but the kluge is an abomination.  It is non-local,
and hence screws string processing.  It breaks unix's critical shebang
hack.  No wonder Microsoft loves it!  It disrupts life on unix and life
on big-endian machines.

All things considered, the unicode people have done a wonderful job.
But the job isn't done yet, and maybe unicode will never be right for
everybody, so I think Ruby should support other character sets as well,
including some which are not compatible with unicode.                        
    
                                 Regards, Bret

http://www.rexx.com/~oinkoink
oinkoink at rexx dot DON'T_SPAM_ME_PLEASE com