On Mon, 22 Sep 2008 23:03:12 +1000, James Gray <james / grayproductions.net>  
wrote:

> On Sep 21, 2008, at 9:35 PM, Martin Duerst wrote:
>
>> In terms of potential problems, I see the following:
>> - A library sets Encoding::default_internal. That would lead
>>  to serious problems, and should be clearly advised against
>>  in the documentation. Libraries either have to be written
>>  in a general way, or have to document that they only work
>>  with certain values of Encoding::default_internal
>>  (this proposal would therefore help you, but not e.g.
>>   James Gray for the CVS library)
>
> I really think we need to avoid any solution that means we will need to  
> change all existing libraries, even just to declare their supported  
> encodings.  Enough libraries are already broken on 1.9 without us adding  
> to that and so many great libraries are no longer maintained at all.
>
> The current situation is probably that we have to be very careful what  
> we pass into these Unicode only libraries to get them to work.  That's  
> far from ideal but, it's better than having the library fail to load at  
> all due to some global setting I may not have even created (assuming I  
> required code that made the change).

As long as "default_internal" is used sanely, I actually think that it may  
IMPROVE the library support situation, because its use will make "encoding  
compatibility errors" less likely to rear their ugly heads.

As long as IO obeys default_internal's setting, I think most other  
libraries should just work. I quickly checked "OpenURI", for example, and  
(assuming I understand the code correctly) it calls IO#set_encoding  
passing the charset read from the HTTP header, setting the "external  
encoding" of the socket. So as long as IO leaves the "internal encoding"  
set to the default_internal setting, open-uri should work as required,  
returning the data in the default_internal encoding.

By "sanely" I mean that default_internal is set at the start of the  
program, and not changed (or at least not changed between reads of a file,  
for instance). Also if libraries supporting only Unicode are used then it  
should either NOT be set (and the Ruby program must then be careful about  
what it passes to it) or be set to UTF-8. Similarly if the library only  
supports ASCII, you wouldn't want to set default_internal to a non-ascii  
compatible encoding (very unlikely I think).

I guess if the possibility of changing "default_internal" seems too  
problematic, it could be implemented the way "default_external" is -  
read-only and set either via a command line flag or to a default. Perhaps  
the default should simply be the encoding of the ruby program itself. But  
this idea would mean that for Ruby to behave as it does at the moment, you  
would have to specifically turn it off somehow.

Mike