On Sun, 21 Sep 2008 02:05:30 +1000, Yukihiro Matsumoto  
<matz / ruby-lang.org> wrote:

> |- How a Japanese programmer would handle the situation of dealing with a
> |combination of a Japanese non-Unicode compatible character set, and say  
> a|UTF-8 encoding which included non-ascii characters, and non-Japanese  
> ones.
> |ie: Is there a reasonable alternative to encoding both to Unicode &
> |somehow dealing with the "difficult characters" as special cases?
>
> Unicode is getting better each day.  So it now covers almost all
> day-to-day problems.  Some cellphone problems are covered by using
> private area.

I infer from this that really Unicode is the only (imperfect) solution for  
true m17n where we have a mixure of completely different character sets  
(eg: Japanese & Arabic)?
What I think this means is that there is no "one size fits all" solution,  
unfortunately.

So I have an alternate suggestion. Maybe I should rename this thread  
"Character encodings - a less radical suggestion" :-)

Ruby already has "Encoding::default_external", so why not also have  
"default_internal"? This option would either be left unset (or NIL I  
guess) or set to an encoding, likely to be UTF-8 in practice, but maybe  
there would be a use for it to choose say one of the Japanese encodings if  
you have a variety of Japanese encodings to handle.

When "default_internal" is nil, Ruby will work as it does now:
- Ruby libraries such as I/O & network libraries will by default return  
character data in the external encoding
- No transcoding will take place unless specifically requested by the Ruby  
program
- The Ruby program is responsible for ensuring that the encodings are what  
it expects, that strings passed to & from Ruby libraries are in the  
encoding the library expects, and that "Encoding Compatibility Errors"  
will occur if it is not careful etc.

When "default_internal" is set to an encoding "E":
- Ruby libraries such as I/O & networking libraries will by default  
transcode to/from internal encoding E (unless specifically overridden by  
an option to the class)
- A Ruby program can then be confident that all strings it handles will be  
in encoding E, so it doesn't have to worry about encoding compatibility.  
For example it can be sure that if "s" is "abc" then "s == 'abc'" is true,  
no matter where the string "s" originated from.
- Assuming that E is an "ascii-compatible" encoding, the Ruby programmer  
doesn't have to face issues like "The value is #{val}" substitution  
failing because "val" is non-ascii compatible.
- The "downside" as pointed out by a number of people is that not all  
characters may be transcoded cleanly or even be supported (driving without  
a seat-belt? :-)), but then programs requiring this level of control  
should probably not use this feature.

Consequences of this suggestion:
- Don't have to change the current implementation of encodings, String or  
Regexp
- Avoids "automagical transcoding" within String & Regexp methods
- Responsibility of implementing "default_internal" lies with a certain  
set of Ruby libraries like IO & networking

Hope this makes sense.
Mike