"Bret Jolly" <oinkoink+unet / rexx.com> wrote in message
news:7e7131a1.0208091559.7f59a71c / posting.google.com...
> Curt Sampson <cjs / cynic.net> wrote in message
news:<Pine.NEB.4.44.0208081139480.17422-100000 / angelic.cynic.net>...
> > Well, actually the point with UTF-16 is that you can, in general, safely
> > ignore the variable width stuff. I don't think you can do that so easily
> > in UTF-8. If I chop off a UTF-8 sequence in the middle, are applications
> > that read it required to ignore that, as they are with surrogates in
> > UTF-16? Or is it likely that they will break, instead?
> >
>    UTF-8 is designed so that you always know if you are in the
> middle of a character (provided that you know you are reading UTF-8).

I have nothing against UTF-8, but it isn't really a strong point. When would
you ever break a UTF-8 string in this way, except to recover from a disk
crash or broken transmission?

You can't look up the 10'th character and say, oups I'm in the middle of a
character, so I must backtrack, because where did you get that index from in
the first place.

UTF-8 is only useful for forward or backward  (I might just have answered my
own question there) scanning applications. But these applications are also
very common, as this is the case with regular expressions. An 8 bit regular
expression works well in many UTF-8 scenarios. And since input from network
or disk can be expected to be UTF-8, this is a strong point for UTF-8.

It is far less common to index strings by number of characters. But when you
do, UCS-4 is better. A typical application is text formatting where you
wan't the nearest linebreak to the given width of say 80 characters. UCS-2
is probably not good enough because combining (or surrogates or whatever)
will only take up one display unit. But then you probably need to take
proportional spacing into account anyway.

One reason for character indexed lookup is to store the position in a string
as an integer, once you have scanned to that position. But you can represent
the position by other means (objects internally pointing the location).
Therefore, what is important is good iterators and position objects, It
doesn't really matter how the internal representation is.

I would certainly like to be able to handle UTF-8 without spending time on
needless conversions. Yet I believe a fixed width format is also a
requirement.
I therefore believe a UTF-8 string and a fixed width string are both
relevant. The fixed width format can store whatever coding it likes (as matz
plans to), I see no reason to limit it to Unicode. For all I care, someone
could be storing Marsian genetic sequences or file block allocation bitmaps
in it.
I just need to be able to query its encoding at all times and be able to
ensure that I don't accidentially mix my Korean homework assignment with my
list disk defragmentation data.

It does, however, appear that Unix have chosen UCS-4 where Microsoft started
out with UCS-2. As such it may be more future proof to choose the bloated
32bit version, which also would solve my 80 character width formatting
problem in the foreseable future.

So I propose UCS-4 with encoding stored (this is also important for
serialization), and UTF-8 for the more daily household string handling. If
we really want to beef up things, we should also have a UCS-2 in BSTR
compatible format as this is the format you use to communicate to the
Windows API and to COM objects. BSTR's stores a 32bit length immediately
before the actual 16bit wide characters, terminated by a null character to
be compatible with C's wchar_t strings. The format is not good for editing,
but it can instantly be made an argument for C and Windows API's.

I hope this point view is from a more practical perspective than the views
of what format does most damage to some minorities. There are 3 widely
deployed real life encodings in API's now.

Mikkel