Hi,

Since doing the default_internal patch, I have had a quick look at some  
libraries to try to see what is involved in making them handle encodings  
properly. As a result, I have a number of suggestions.

The major aim of the "default_internal" patch was to make it easier for  
Ruby programmers to handle multiple encodings and to avoid "Encoding  
Compatibility Errors". In my brief time playing with the patch I think it  
achieves this reasonably well, but I think a few minor improvements could  
be made.

Also when writing Ruby libraries (whether we are talking about the  
standard ones or user-written ones), we want to avoid worrying about  
encodings for the most part (unless of course the library is specifically  
to handle encodings), but this can be difficult, as the library has no  
control over the encodings used in the calling program. What should be  
done?

Back to basics - where are the sources of strings with different encodings?
1) Data read from different files & the network. "default_internal"  
certainly helps unify things here.
2) String & regexp literals in your program.
3) Strings constructed in your program by transforming other forms of  
data. eg: Using Array#pack, encrypting/decrypting,  
compressing/decompressing etc;
4) Strings constructed in your program by combining/manipulating other  
strings;

As long as all these strings are in the same or at least "compatible"  
encodings, there are no problems.
Item (2) above is not a major problem as long as:
- All String & Regexp literals are ASCII, and
- "default_internal" is ASCII-compatible.

In practice, unless you are writing code expressly to deal with certain  
encodings, your string & regexp literals will be ASCII. I believe that  
this is the case for the standard Ruby libraries, for example.

Item (3) remains problematic, as the library or method may not know what  
encoding the data should be set to.
Item (4) will produce compatible strings if all the strings used are  
compatible (except when you change encodings on purpose).

Most libraries that we write and use do not (and should not) care about  
the string encodings used or returned. I'll call these libraries "encoding  
naive".

Therefore I'd like to propose some "golden rules/guidelines":

a) When writing encoding naive libraries, only ever use ASCII string &  
regexp literals in them.

b) "default_internal" should ALWAYS be set to an ASCII compatible encoding.

c) Encoding naive libraries should not attempt to support non-ASCII  
compatible internal encodings.

d) Libraries should expect all strings passed to them which are character  
data to be in the "default_internal" encoding. Input "binary" data strings  
should be "forced" to ASCII-8BIT if necessary. In most cases there should  
NOT be a need for a library to specifically check the encoding of strings  
passed to it. (If the calling program is not being consistent about the  
encodings of strings passed to libraries, there *will* be Encoding  
Compatibility Errors, but I think that's acceptable)

e) Libraries should normally return string data in the "default_internal"  
encoding if the data is characters, or in ASCII-8BIT if it is "binary" (or  
perhaps if the encoding cannot be determined).

Many existing libraries already comply with these rules. Those that don't  
will need a few changes, but by sticking to these guidelines, I think you  
can avoid most of the pain that James went through on CSV.

To assist further, I'd also like to make a few suggestions for Ruby itself:

1. default_internal should always be set (like "default_external"). If not  
specified, I suggest it default to the same value as "default_external" -  
Note: not 100% backward compatible with 1.9.0. You can use mode "ext:-" in  
IO if you really need to suppress transcoding on input.
2. default_internal should not be able to be set to a non-ASCII compatible  
encoding (ensures compatability with ASCII string literals);
3. IO#write and friends should be changed so that when writing a file with  
an external encoding of ASCII-8BIT, that no transcoding be attempted - ie:  
just write out the raw bytes. This will help with writing a file  
containing multiple or arbitrary encodings (you won't have to use  
force_encoding("ASCII-8BIT") all the time).

What do you think? I am sure there are things I missed.

Cheers
Mike