On 3/11/06, Michal Suchanek <hramrach / centrum.cz> wrote:> On 3/11/06, Austin Ziegler <halostatue / gmail.com> wrote:>> UTF-16 is actually pretty performant and the implementation of>> wchar_t on MacOS X and Windows is (you guessed it!) UTF-16. The>> filesystems for both of these operating systems (which have *far*>> superior Unicode support than anything else) both use UTF-16 as the>> native filename encoding (this is true for HFS+, NTFS4, and NTFS5).>> The only difference between what MacOS X does and Windows does for>> this is that Apple chose to use decomposed characters instead of>> composed characters (e.g., LOWERCASE E + COMBINING ACUTE ACCENT>> instead of LOWERCASE E ACUTE ACCENT).>>>> Look at the performance numbers for ICU4C: it's pretty damn good.>> UTF-32 isn't exactly space conservative (since with UTF-16 *most* of>> the BMP can be represented with a single wchar_t, and only a few need>> surrogates taking up exactly *two* wchar_ts, whereas *all* characters>> would take up four uint32_t under UTF-32). ICU4C uses UTF-16>> internally. Exclusively.> I do not care what Windows, OS X, or ICU uses. I care what I want to> use. Even if most characters are encoded with single word you have to> cope with multiword characters. That means that a character is not a> simple type. You cannot have character arrays. And no library can> completely wrap this inconsistency and isolate you from dealing with> it.
If you're simply dealing with text, you don't need arrays of characters.Frankly, if you don't care what Windows, OS X, and ICU use, then you'recompletely ignorant of the real world and what is useful and necessaryfor Unicode.
By the way, you are wrong -- you *can* have arrays of characters. It'sjust that those characters are not guaranteed to be a fixed length. Itwill be the same with Ruby moving forward.
> Even if the library is performant with multiword characters it is> complex. That means more prone to errors. Both in itself and in the> software that interfaces it.
Nice theory. What reduces the number of errors is no longer thinking interms of arrays of characters, but in terms of text strings.
> You say that utf-16 is more space-conserving for languages like> Japanese. Nice. But I do not care. I guess text consumes very small> portion of memory on my system. Both ram and hardrive.  I do not care> if that doubles or quadruples. In the very few cases when I want to> save space (ie when sending email attachments) I can use gzip. It can> even compress repetitive text which no encoding can.
If you don't care, then why are you arguing here? The Japanese -- whichwould include Matz -- *do* care.
>> On 3/10/06, Anthony DeRobertis <aderobertis / metrics.net> wrote:>>> Austin Ziegler wrote:>>> Personally, my file names have been in UTF-8 for quite some time>>> now, and it works well: What exactly is this 'stone age' you refer>>> to?>> Change and environment variable and watch your programs break that>> had worked so well with Unicode. *That* is the stone age that I refer>> to. I'm also guessing that you don't do much with long Japanese>> filenames or deep paths that involve *anything* except US-ASCII (a>> subset of UTF-8).> Hmm, so you call the possibility to choose your encoding living in> stone age. I would call it living in reality. There are various> encodings out there.
Yes, it's the stone age. The filesystem should allow you to see thingsin UTF-8 or SJIS or EUC-JP if you want, but internally it should beusing something a hell of a lot smarter than those encodings. This iswhat HFS+ and Windows allow.
>>> UTF-8 can take multiple octets to represent a character. So can>>> UTF-16, UTF-32, and every other variation of Unicode.>> This last statement is true only because you use the term "octet.">> It's a useless term here, because UTF-8 only has any level of>> efficiency for US-ASCII. Even if you step to European content, UTF-8>> is no longer perfectly efficient, and when you step to Asian content,>> UTF-8 is so bloody inefficient that most folks who have to deal with>> it would rather work in a native encoding (EUC-JP or SJIS, anyone?)>> which is 1..2 bytes or do everything in UTF-16.> No, I suspect the reason for using EUC-JP, SJIS, or ISO-8859-*, and> other weird encodings is historical. What do you mean by efficiency?> If you want space efficiency use compression. If you want speed, use> utf-32 or similar encoding that does not have to deal with special> cases.
You'd be half-right. The historical reason is that most programs stilldon't deal with Unicode properly. On Unix/POSIX this is mostly becauseof the brain-dead nonsense related to locales. On Windows this is mostlybecause of entrenched behaviours.
However, there is significant resistance to Unicode in Asian countriesbecause of politics, and to UTF-8 in particular because of itsinefficiency both in processing and storage. UTF-32 is equallyinefficient in storage to *all* language, and UTF-16 is the balancebetween those two. That's why UTF-16 was chosen for HFS+ and NTFS.
>>> Depending on content, a string in UTF-8 can consume more octects than>>> the same string in UTF-16, or vice versa.>>>>>> Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't>>> get to have the fun of picking between big- and little-endian!>>>> Are people always this stupid when it comes to things that they clearly>> don't understand? Yes, UTF-16 may have the problem of not knowing if>> you're dealing with UTF-16BE or UTF-16LE, but it's my understanding that>> this is *only* an issue when you're dealing with both on the same>> system. Additionally, most platforms specify a default. It's been a>> while (almost a year), but I think that ICU4C defaults to UTF-16BE>> internally, not just UTF-16.> iirc there are even byte-order marks. If you insert one in every> string you can get them identified at any time without doubt :)
You're right. There are. They're one of the mistakes with UTF-16, IMO.
> But do not trust me on that. I do not know anything about unicode, and> I want to sidestep the issue by using an encoding that is easy to work> with, even for ignorants :P
Which is why I want what Matz is providing. Something where it doesn'tmatter what encoding you have, but something where Ruby provides theability *natively* to switch between these encodings.
-austin--Austin Ziegler * halostatue / gmail.com               * Alternate: austin / halostatue.ca