"Bret Jolly" <oinkoink+unet / rexx.com> wrote in message news:7e7131a1.0208091559.7f59a71c / posting.google.com... > Curt Sampson <cjs / cynic.net> wrote in message news:<Pine.NEB.4.44.0208081139480.17422-100000 / angelic.cynic.net>... > > Well, actually the point with UTF-16 is that you can, in general, safely > > ignore the variable width stuff. I don't think you can do that so easily > > in UTF-8. If I chop off a UTF-8 sequence in the middle, are applications > > that read it required to ignore that, as they are with surrogates in > > UTF-16? Or is it likely that they will break, instead? > > > UTF-8 is designed so that you always know if you are in the > middle of a character (provided that you know you are reading UTF-8). I have nothing against UTF-8, but it isn't really a strong point. When would you ever break a UTF-8 string in this way, except to recover from a disk crash or broken transmission? You can't look up the 10'th character and say, oups I'm in the middle of a character, so I must backtrack, because where did you get that index from in the first place. UTF-8 is only useful for forward or backward (I might just have answered my own question there) scanning applications. But these applications are also very common, as this is the case with regular expressions. An 8 bit regular expression works well in many UTF-8 scenarios. And since input from network or disk can be expected to be UTF-8, this is a strong point for UTF-8. It is far less common to index strings by number of characters. But when you do, UCS-4 is better. A typical application is text formatting where you wan't the nearest linebreak to the given width of say 80 characters. UCS-2 is probably not good enough because combining (or surrogates or whatever) will only take up one display unit. But then you probably need to take proportional spacing into account anyway. One reason for character indexed lookup is to store the position in a string as an integer, once you have scanned to that position. But you can represent the position by other means (objects internally pointing the location). Therefore, what is important is good iterators and position objects, It doesn't really matter how the internal representation is. I would certainly like to be able to handle UTF-8 without spending time on needless conversions. Yet I believe a fixed width format is also a requirement. I therefore believe a UTF-8 string and a fixed width string are both relevant. The fixed width format can store whatever coding it likes (as matz plans to), I see no reason to limit it to Unicode. For all I care, someone could be storing Marsian genetic sequences or file block allocation bitmaps in it. I just need to be able to query its encoding at all times and be able to ensure that I don't accidentially mix my Korean homework assignment with my list disk defragmentation data. It does, however, appear that Unix have chosen UCS-4 where Microsoft started out with UCS-2. As such it may be more future proof to choose the bloated 32bit version, which also would solve my 80 character width formatting problem in the foreseable future. So I propose UCS-4 with encoding stored (this is also important for serialization), and UTF-8 for the more daily household string handling. If we really want to beef up things, we should also have a UCS-2 in BSTR compatible format as this is the format you use to communicate to the Windows API and to COM objects. BSTR's stores a 32bit length immediately before the actual 16bit wide characters, terminated by a null character to be compatible with C's wchar_t strings. The format is not good for editing, but it can instantly be made an argument for C and Windows API's. I hope this point view is from a more practical perspective than the views of what format does most damage to some minorities. There are 3 widely deployed real life encodings in API's now. Mikkel