On Fri, 2 Aug 2002, MikkelFJ wrote: > > Yeah. This is getting into complex nightmare city. That's why I'd prefer > > to have the basic system just work completely in Unicode. One could have > > a separate character system (character and string classes, byte-stream > > to char converters, etc.) to work with this tagged format if one wished. > > But isn't this what matz suggest? > Each stream is tagged, that is the same as having different types. It's > basically just a different way to store the type while having a lot of > common string operations. No, because then you have to deal with conversions. Most popular character sets are convertable to Unicode and back without loss. That is not true of any arbitrary pair of character sets, though, even if you go through Unicode. The reason for this is as follows. Say character set Foo has split a unified hanji, "a", and also has "A". When converting to Unicode, that "A" will be preserved because it's assigned a code point in a compatability area, and when you convert back from Unicode, that "A" will be translated to "A" in Foo. However, if character set Bar does not have "A", just "a", the "A" will be converted to "a". When you go from Bar back to Unicode, you end up with "a" again because there's no way to tell that it was originally "A" when you converted out. But there's an even better reason than this for converting to Unicode on input, rather than doing internal tagging. If you don't have conversion tables for a particular character encoding, it's much better to find out at the time you try to get the information in to the system than at some arbitrary later point when you try to do a conversion. That way you know where the problem information is coming from. In terms of interface, I would say: 1. Continue to use String as it is for "binary" data. This is efficient, if you don't need to do much processing. 2. Add a UString or similar for dealing with UTF-16 data. There's no need for surrogate support in this, for reasons I will get into below, so this is straight fixed width. Reasonably efficient (almost maximally efficient for those of us using Asian languages :-)) and very easy to use. 3. Add other, specialized classes when you need to do special purpose things. No need for this in the standard distribution. > BTW: Unicode is not a fixed with format. In terms of code values, it is fixed width. However, some characters are represented by pairs of code values. > ...but there are escape codes... No, there are no escape codes. The high and low code values for surrogate characters have their own special areas, and so are easily identifiable. > and options for future extensions. Not that I know of. Can you explain what these are? > Hence UCS-4 is a strategy with limited timespan. Not at all, unless they decide to change Unicode to the point where it no longer uses 16-bit code values, or add escape codes, or something like that. That would be backward-incompatable, severely complicate processing, and generally cause all hell to break lose. So I'd rate this as "Not Likely." Here are a few points to keep in mind about Unicode processing: 1. The surrogate pairs are almost never used. Two years ago there weren't even any characters assigned to those code points. 2. There are many situations where, even if surrogate pairs are present, you don't know or care, and need do nothing to correctly deal with them. 3. Broken surrogate pairs are not a problem; the standard says you must be able to ignore broken pairs, if you interpret surrogate pairs at all. 3. The surrogate pairs are extremely easy to distinguish, even if you don't interpret them. 4. The code for dealing with surrogate pairs well (basically, not breaking them) is very simple. The implication of point 1 is that one should not spend a lot of effort dealing with surrogate pairs, as very few users will ever use them. Very few Asian users will ever use them in their lifetimes, in fact. The implication of points 2 and 3 are that not everything that deals with Unicode has to deal with, or even know about, surrogate pairs. If you are writing a web application, for example, your typical fields you just take as a whole from the web browser or database, and give as a whole to the web browser or database. Thus only the web browser really has any need at all to deal with surrogate pairs. If you take a substring of a string and in the process end up with a surrogate pair half on either end, that's no problem. It just gets ignored by whatever facilities deal with surrogate pairs, or treated as an unknown character by those that don't (rather than two unknown characters for an unsplit surrogate pair). The only time you really run into a problem is if you insert something into a string; there's a chance you might split the surrogate pair, and lose the character. This is pretty uncommon except in interactive input situations, though, where you know how to handle surrogate pairs and can avoid doing this, or where you don't know and the user can't see the characters anyway. Well, another area you can run into problems with is line wrapping, but there's no single algorithm for that anyway, and plenty of algorithms break on languages for which they were not designed. So there you should add some very simple code that avoids splitting surrogate pairs. (This code is much simpler than the line wrapping code anyway, so it's hardly a burden.) That shows the advantages of points 3 and 4 (essentially the same point). So I propose just what the Unicode standard itself proposes in section 5.4: UString (or whatever we call it) should have the Surrogate Support Level "none"; i.e., it completely ignores the existence of surrogate pairs. Things that use UString that have the potential to encounter surrogate pair problems or wish to interpret them can add simple or complex code, as they need, to deal with the problem at hand. (Many users of UString will need to do nothing.) Note that there's a big difference between this and your UTF-8 proposal: ignoring multibyte stuff in UTF-8 is going to cause much, much more lossage because there's a much, much bigger chance of breaking things when using Asian languages. With UTF-16, you probably won't even encounter surrogates, whereas with Japanese in UTF-8, pretty much every character is multibyte. cjs -- Curt Sampson <cjs / cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC