On Thu, 10 Mar 2005 04:10:17 +0900, Berger, Daniel
<Daniel.Berger / qwest.com> wrote:
>> -----Original Message-----
>> From: Austin Ziegler [mailto:halostatue / gmail.com ]
>> Sent: Wednesday, March 09, 2005 11:48 AM
>> To: ruby-core / ruby-lang.org 
>> Subject: Re: Win32 Non-ASCII Filename Access
>> Okay -- let's try again. Ruby isn't written in Microsoft's
>> dialect of C++. It doesn't use TCHAR. It uses char. Saying that
>> Ruby needs to use TCHAR would be a Bad Thing.
> Why? Who's to say MS got it wrong? Types like TCHAR can easily be
> defined in ruby.h (or wherever), since *nix understands wchar_t
> perfectly well. Would it not have been better to do this:

MS got it wrong. wchar_t isn't portable. UTF-8 is.

> #ifdef UNICODE
> typedef wchar_t TCHAR;
> #else
> typedef unsigned char TCHAR;
> #endif
> 
> And then have Matz decree, "Thou shallt use TCHAR, not char*, in
> your C extensions"? Would it not have been better for Ruby 1.x if
> it had taken this approach?

No, it wouldn't have been. Again, not all extensions could
legitimately use wchar_t in any case.

> This isn't a rhetorical question - I'm genuinely curious. Are
> there factors I'm not considering that make this impractical?

Yes. External linkages that depend on char, not wchar_t. It may be
as well that the regular expression engine(s) don't work as well,
either. Not only that, but is "strlen" a #define for "_strlen" or
"_wcslen" on Unix as it is for Windows? There's yet a worse problem
lurking in your assumption below, though.

>> I don't have the Ruby code in front of me, but a lot of things
>> probably wouldn't work quite the same if we used the UNICODE
>> macro. String#each_byte, anyone?
> It would be a case of caveat programmor in the case of String
> methods and the like. If you're using unicode, then something like
> String#each_byte would either return 2 bytes per char, or would
> yield two separate 8-bit chars.

But then we have two *different* versions of Ruby simply based on
how Ruby was compiled. If someone doesn't #define UNICODE, they'll
get one set of semantics; if they do, they'll get another. BAD BAD
BAD language.

> Maybe that's a bit harsh, but my philosophy is, if you're going to
> use Unicode, make sure you understand the ramifications.
> Undoubtedly, many will disagree. :)

I completely disagree, especially since there's a perfectly suitable
format that's 8-bit clean and supported by numerous tools out there,
including Oniguruma. It's called UTF-8.

Maybe it would be nice if Ruby provided some helper functions or
macros to wrap MultibyteToWide and WideToMultibyte -- I'd gladly
provide those. But fundamentally, if you switch all chars to
wchar_t, you lose the ability to link and call all *sorts* of
libraries.

> I'm actually curious what Matz and other folks think about the
> idea of defining a TCHAR for Ruby.

I think it would be an unmitigated disaster.

All I need is the ability to work with non-ANSI filenames; you may
need a bit more to call the Unicode versions of certain functions
that work better with Unicode strings. I don't need to change the
fundamental character type for Ruby -- I *prefer* working with UTF-8
strings over UCS-2 strings (which aren't even "good" Unicode
strings; the preferred form is UTF-16 if you're going to go with
wide characters). Why break something that works?

-austin
-- 
Austin Ziegler * halostatue / gmail.com
               * Alternate: austin / halostatue.ca