Wrote Martin Elzen <martinelzen / hotmail.com>, on Tue, May 04, 2004 at 03:27:56AM +0900: > Hi everyone. > > >> |So then what's Unicode for in the first place? I thought the aim was > >to > >> |have a universal encoding for all chars. Did I miss something? > > > >> It's _their_ intention. Whether it succeeds or not is another story. > >> I think they tried their best, but it is virtually impossible to > >> satisfy all requirement for internationalization. > > > >Why is that? Is there not enough room for every character known to > >man, or is there some other problem? > > I've been reading up on Unicode recently, and it strikes me as a standard > that is: > a) awesome, if you only consider what it could help you do > b) a *MAJOR* headache for those who need to implement it > > Each Unicode 'character' can be represented in a 21 bit integer, so > representing them with a 32 bit wide integer would be no problem (that is, > if you're looking for the *easiest* way to store them in memory, and don't > care too much about how much memory it would cost). UTF-8 is another good way to store them, it costs more space the more you deviate from ASCII, but is a pretty good storage format non-the-less. It also gets you away from the classic problem of storing Unicode as > 8 bit integers: which endian do you do it? There isn't an endianness to UTF-8. Of course, there is a Unicode byte order mark (BOM), that says what the order is, but it falls into the category of "stuff that makes strings hard to compare equal, even though they are", discussed below. > The hard part is everything that comes after that. Only a relatively few > ranges of Unicode characters can be capitalized, for example. The ASCII > subset is easy to (un)capitalise, just add (or subtract) a constant and > there you go. But for other Unicode ranges the capitalised and > non-capitalised forms of each character are right next to each other. > Although in my understanding the Japanese Hiragana 'characters' don't > actually have a non-capitalised form, they do have a large and a small > form in Unicode. And those small and normal size forms are next to each > other. Not to mention the issue of normal width and half width Asian > 'characters'... Coming up with a generalised to_upper_case / to_lower_case > function is a problem. For Japanese, at least, I don't believe you would ever do the operation. There may be two forms, but they are pronounced differently. You would mangle the sentence, it would be like changing a's to o's, it might be recognizable, but... > A bigger problem is how Unicode lets you compose characters. I'm not > saying such a character exists, but what if there was a character > consisting of a lower case a with an umlaut and a '^' character. Unicode > would let you define that character by saying it's a lowercase a with_a > umlaut with_a '^'. But it would also let you compose it by saying it's a > lowercase a with_a '^' with_a umlaut. And here's where it gets fun: if you > have 2 strings containing the same character, no matter which order you > used to compose the character, it has to compare as equal if the result of > all the composing elements is the same. And, mind you, this is a > relatively simple example! I don't think this problem is unique to Unicode, any system that doesn't allow combineing characters is the same as Unicode where you agree not to use them, which you could do. Another approach is to define a distinguished encoding. In your example above, for a network protocol, say, where you wanted to memcmp strings and know they are the same without processing the combining characters, you could specify that that the "with_a X" pairs are ordered in order of increasing codepoints for X. Anyhow, as far as I can see, Unicode either has the same problems as other character sets have that have the same capabilities, or else (if its important for a particular application/protocol) you can agree to use a subset of Unicode, and get the same results as some other "simpler" character set. For example, useing a bunch of different character sets isn't going to make the job of capitalization easier, you're still left with the same problem, how to capitalize a language you know nothing about (including if its even possible to do). Anyhow, for programming, I'm a huge fan of UTF-8. It allows you not to have to worry so much. A Unicode module with functions to compare strings in a way that ignores BOMs, considers combinging chars, etc. would be useful. I think a canonical form of UTF-8 will be becoming more and more common in ietf protocols. > Given all of the complexity of Unicode, working with the ASCII character > set seems delightfully simple! Indeed, unless you happen to speak another language! Cheers, Sam -- Sam Roberts <sroberts / certicom.com>