On 3/13/06, Michal Suchanek <hramrach / centrum.cz> wrote:> On 3/12/06, Austin Ziegler <halostatue / gmail.com> wrote:>> On 3/11/06, Michal Suchanek <hramrach / centrum.cz> wrote:>>> I do not care what Windows, OS X, or ICU uses. I care what I want to>>> use. Even if most characters are encoded with single word you have>>> to cope with multiword characters. That means that a character is>>> not a simple type. You cannot have character arrays. And no library>>> can completely wrap this inconsistency and isolate you from dealing>>> with it.>> If you're simply dealing with text, you don't need arrays of>> characters. Frankly, if you don't care what Windows, OS X, and ICU>> use, then you're completely ignorant of the real world and what is>> useful and necessary for Unicode.> The native encoding is bound to be different between platforms. I want> to use an encoding that I like on all platforms, and convert the> strings for filenames or whatever to fit the current platform. That is> why I do not care what a particular platform you name uses.
I think you're just confused here, Michal.
>> By the way, you are wrong -- you *can* have arrays of characters.>> It's just that those characters are not guaranteed to be a fixed>> length. It will be the same with Ruby moving forward.> Yes, you can have arrays of strings. Nice. But to turn a text string> into a string of characters you have to turn it into an array of> strings. Instead of just indexing an array of basic types that> represent the characters.
> And there is a need to look at the actual characters at times.There> are programs that actually process the text, not only save what the> user entered in a web form. I can think of text editors, terminal> emulators, and linguistic tools. I am sure there are others.
NO! This is where you're 100% wrong. Text editors, terminal emulators,and linguistic tools *especially* should never be looking at the rawbytes underneath the character strings. They should be dealing with thecharacters as discrete entities.
This is what I'm talking about. Byte arrays as characters are nonsensein today's world. If you don't have an encoding attached to something,then you can't *possibly* know what it means.
>>> Even if the library is performant with multiword characters it is>>> complex. That means more prone to errors. Both in itself and in the>>> software that interfaces it.>> Nice theory. What reduces the number of errors is no longer thinking in>> terms of arrays of characters, but in terms of text strings.> Or strings of strings of 16-bit words, packed? No, thanks. I want to> avoid that.
Um. You're confused here. It's a text string with a UTF-16 encoding.
>>> You say that utf-16 is more space-conserving for languages like>>> Japanese. Nice. But I do not care. I guess text consumes very small>>> portion of memory on my system. Both ram and hardrive.  I do not>>> care if that doubles or quadruples. In the very few cases when I>>> want to save space (ie when sending email attachments) I can use>>> gzip. It can even compress repetitive text which no encoding can.>> If you don't care, then why are you arguing here? The Japanese -->> which would include Matz -- *do* care.> I do not care about the space inefficiency. Be it inefficiency in> storing Czech text, Japanese text, English text, or any other. It has> nothing to do with the fact I do not speak Japanese.
> I think that most of my ram and hardrive space is consumed by other> stuff than text. For that reason I do not care about the relative> efficiency of text encoding. It will have minimal impact on the> performance or amounts of memory consumed on the system. And there is> always the possibility to compress the text.
Then you are willfully ignorant of the concerns of a lot of people.
>>>> On 3/10/06, Anthony DeRobertis <aderobertis / metrics.net> wrote:>>>>> Austin Ziegler wrote:>>>>> Personally, my file names have been in UTF-8 for quite some time>>>>> now, and it works well: What exactly is this 'stone age' you refer>>>>> to?>>>> Change and environment variable and watch your programs break that>>>> had worked so well with Unicode. *That* is the stone age that I>>>> refer to. I'm also guessing that you don't do much with long>>>> Japanese filenames or deep paths that involve *anything* except>>>> US-ASCII (a subset of UTF-8).>>> Hmm, so you call the possibility to choose your encoding living in>>> stone age. I would call it living in reality. There are various>>> encodings out there.>> Yes, it's the stone age. The filesystem should allow you to see>> things in UTF-8 or SJIS or EUC-JP if you want, but internally it>> should be using something a hell of a lot smarter than those>> encodings. This is what HFS+ and Windows allow.> Well, the libc could store the strings in some utf-* encoding on the> disk, and translate that based on the current locale. I wonder if that> is against POSIX or not.
It's against POSIX because POSIX specifies that the disk should bestoring the file in single-character bytes and that 0x00 is a filenameterminator. POSIX, though, is stupid.
> But it is not done, and it is wrong. There are problems of this kind> on Windows as well. It is still not recommended to use non-ascii> characters in filenames around here..
Actually, that's only if you're using stupid programs. Unfortunately,that includes Ruby right now. When Matz gets the M17N strings checkedinto Ruby 1.9, I will be working toward significantly improving theWindows filesystem handling so that full Unicode is supported.
-austin--Austin Ziegler * halostatue / gmail.com               * Alternate: austin / halostatue.ca