On 3/10/06, Michal Suchanek <hramrach / centrum.cz> wrote:> On 3/10/06, Austin Ziegler <halostatue / gmail.com> wrote:>> On 3/8/06, Richard Gyger <richard / bytethink.com> wrote:>>> so, you guys are telling me a language developed since the year 2000>>> doesn't support unicode strings natively? in my opinion, that's a>>> pretty glaring problem.>> Please note that Ruby itself is ten years old. Unicode has only>> *recently* (the last three or four years, with the release of Windows>> XP) become a major factor, especially in Japan. Unix support for>> Unicode is still in the stone ages because of the nonsense that POSIX>> put on Unix ages ago. (When Unix filesystems can write UTF-16 as>> their native filename format, then we're going to be much better.>> That will, however, break some assumptions by really stupid>> programs.)> Why the hell utf-16? It is no longer compatible with ascii, yet 16> bits are far from sufficient to cover current unicode. So you still> get multiword characters. It is not even dword aligned for fast> processing by current cpus. I would like utf-8 for compatibility, and> utf-32 for easy string processing. But I do not see much use for> utf-16.
UTF-16 is actually pretty performant and the implementation of wchar_ton MacOS X and Windows is (you guessed it!) UTF-16. The filesystems forboth of these operating systems (which have *far* superior Unicodesupport than anything else) both use UTF-16 as the native filenameencoding (this is true for HFS+, NTFS4, and NTFS5). The only differencebetween what MacOS X does and Windows does for this is that Apple choseto use decomposed characters instead of composed characters (e.g.,LOWERCASE E + COMBINING ACUTE ACCENT instead of LOWERCASE E ACUTEACCENT).
Look at the performance numbers for ICU4C: it's pretty damn good. UTF-32isn't exactly space conservative (since with UTF-16 *most* of the BMPcan be represented with a single wchar_t, and only a few need surrogatestaking up exactly *two* wchar_ts, whereas *all* characters would take upfour uint32_t under UTF-32). ICU4C uses UTF-16 internally. Exclusively.
On 3/10/06, Anthony DeRobertis <aderobertis / metrics.net> wrote:> Austin Ziegler wrote:>> Unix support for Unicode is still in the stone ages because of the>> nonsense that POSIX put on Unix ages ago. (When Unix filesystems can>> write UTF-16 as their native filename format, then we're going to be>> much better. That will, however, break some assumptions by really>> stupid programs.)> Ummm, no. UTF-16 filenames would break *every* correctly-implemented> UNIX program: UTF-16 allows the octect 0x00, which has always been the> end-of-string marker.
You're right. And I'm saying that I don't care. People need to stopthinking in terms of bytes (octets) and start thinking in terms ofcharacters. I'll say it flat out here: the POSIX filesystem definitionis going to badly limit what can be done with Unix systems. One could dowhat I *think* that Apple has done and provided two filesysteminterfaces that are synchronized. The native interface -- and the moreefficient one -- will be using UTF-16 because that's what HFS+ speaks.The secondary interface (that also works on UFS filesystems) wouldtranslate to UTF-8 and/or follow the nonsensical POSIX rules for nativeencodings.
> Personally, my file names have been in UTF-8 for quite some time now,> and it works well: What exactly is this 'stone age' you refer to?
Change and environment variable and watch your programs break that hadworked so well with Unicode. *That* is the stone age that I refer to.I'm also guessing that you don't do much with long Japanese filenames ordeep paths that involve *anything* except US-ASCII (a subset of UTF-8).
> UTF-8 can take multiple octets to represent a character. So can UTF-16,> UTF-32, and every other variation of Unicode.
This last statement is true only because you use the term "octet." It'sa useless term here, because UTF-8 only has any level of efficiency forUS-ASCII. Even if you step to European content, UTF-8 is no longerperfectly efficient, and when you step to Asian content, UTF-8 is sobloody inefficient that most folks who have to deal with it would ratherwork in a native encoding (EUC-JP or SJIS, anyone?) which is 1..2 bytesor do everything in UTF-16.
> Depending on content, a string in UTF-8 can consume more octects than> the same string in UTF-16, or vice versa.>> Ah! But wait. I can see an advantage to UTF-16. With UTF-8, you don't> get to have the fun of picking between big- and little-endian!
Are people always this stupid when it comes to things that they clearlydon't understand? Yes, UTF-16 may have the problem of not knowing ifyou're dealing with UTF-16BE or UTF-16LE, but it's my understanding thatthis is *only* an issue when you're dealing with both on the samesystem. Additionally, most platforms specify a default. It's been awhile (almost a year), but I think that ICU4C defaults to UTF-16BEinternally, not just UTF-16.
There. Problem solved.
If you're going to babble on about Unicode, it'd be nice if you knew more thanthe knee-jerk stuff you've posted so far. Either of you.
-austin--Austin Ziegler * halostatue / gmail.com               * Alternate: austin / halostatue.ca