On Thu, 10 Mar 2005 04:10:17 +0900, Berger, Daniel <Daniel.Berger / qwest.com> wrote: >> -----Original Message----- >> From: Austin Ziegler [mailto:halostatue / gmail.com ] >> Sent: Wednesday, March 09, 2005 11:48 AM >> To: ruby-core / ruby-lang.org >> Subject: Re: Win32 Non-ASCII Filename Access >> Okay -- let's try again. Ruby isn't written in Microsoft's >> dialect of C++. It doesn't use TCHAR. It uses char. Saying that >> Ruby needs to use TCHAR would be a Bad Thing. > Why? Who's to say MS got it wrong? Types like TCHAR can easily be > defined in ruby.h (or wherever), since *nix understands wchar_t > perfectly well. Would it not have been better to do this: MS got it wrong. wchar_t isn't portable. UTF-8 is. > #ifdef UNICODE > typedef wchar_t TCHAR; > #else > typedef unsigned char TCHAR; > #endif > > And then have Matz decree, "Thou shallt use TCHAR, not char*, in > your C extensions"? Would it not have been better for Ruby 1.x if > it had taken this approach? No, it wouldn't have been. Again, not all extensions could legitimately use wchar_t in any case. > This isn't a rhetorical question - I'm genuinely curious. Are > there factors I'm not considering that make this impractical? Yes. External linkages that depend on char, not wchar_t. It may be as well that the regular expression engine(s) don't work as well, either. Not only that, but is "strlen" a #define for "_strlen" or "_wcslen" on Unix as it is for Windows? There's yet a worse problem lurking in your assumption below, though. >> I don't have the Ruby code in front of me, but a lot of things >> probably wouldn't work quite the same if we used the UNICODE >> macro. String#each_byte, anyone? > It would be a case of caveat programmor in the case of String > methods and the like. If you're using unicode, then something like > String#each_byte would either return 2 bytes per char, or would > yield two separate 8-bit chars. But then we have two *different* versions of Ruby simply based on how Ruby was compiled. If someone doesn't #define UNICODE, they'll get one set of semantics; if they do, they'll get another. BAD BAD BAD language. > Maybe that's a bit harsh, but my philosophy is, if you're going to > use Unicode, make sure you understand the ramifications. > Undoubtedly, many will disagree. :) I completely disagree, especially since there's a perfectly suitable format that's 8-bit clean and supported by numerous tools out there, including Oniguruma. It's called UTF-8. Maybe it would be nice if Ruby provided some helper functions or macros to wrap MultibyteToWide and WideToMultibyte -- I'd gladly provide those. But fundamentally, if you switch all chars to wchar_t, you lose the ability to link and call all *sorts* of libraries. > I'm actually curious what Matz and other folks think about the > idea of defining a TCHAR for Ruby. I think it would be an unmitigated disaster. All I need is the ability to work with non-ANSI filenames; you may need a bit more to call the Unicode versions of certain functions that work better with Unicode strings. I don't need to change the fundamental character type for Ruby -- I *prefer* working with UTF-8 strings over UCS-2 strings (which aren't even "good" Unicode strings; the preferred form is UTF-16 if you're going to go with wide characters). Why break something that works? -austin -- Austin Ziegler * halostatue / gmail.com * Alternate: austin / halostatue.ca