On Sun, 21 Sep 2008 02:05:30 +1000, Yukihiro Matsumoto <matz / ruby-lang.org> wrote: > |- How a Japanese programmer would handle the situation of dealing with a > |combination of a Japanese non-Unicode compatible character set, and say > a|UTF-8 encoding which included non-ascii characters, and non-Japanese > ones. > |ie: Is there a reasonable alternative to encoding both to Unicode & > |somehow dealing with the "difficult characters" as special cases? > > Unicode is getting better each day. So it now covers almost all > day-to-day problems. Some cellphone problems are covered by using > private area. I infer from this that really Unicode is the only (imperfect) solution for true m17n where we have a mixure of completely different character sets (eg: Japanese & Arabic)? What I think this means is that there is no "one size fits all" solution, unfortunately. So I have an alternate suggestion. Maybe I should rename this thread "Character encodings - a less radical suggestion" :-) Ruby already has "Encoding::default_external", so why not also have "default_internal"? This option would either be left unset (or NIL I guess) or set to an encoding, likely to be UTF-8 in practice, but maybe there would be a use for it to choose say one of the Japanese encodings if you have a variety of Japanese encodings to handle. When "default_internal" is nil, Ruby will work as it does now: - Ruby libraries such as I/O & network libraries will by default return character data in the external encoding - No transcoding will take place unless specifically requested by the Ruby program - The Ruby program is responsible for ensuring that the encodings are what it expects, that strings passed to & from Ruby libraries are in the encoding the library expects, and that "Encoding Compatibility Errors" will occur if it is not careful etc. When "default_internal" is set to an encoding "E": - Ruby libraries such as I/O & networking libraries will by default transcode to/from internal encoding E (unless specifically overridden by an option to the class) - A Ruby program can then be confident that all strings it handles will be in encoding E, so it doesn't have to worry about encoding compatibility. For example it can be sure that if "s" is "abc" then "s == 'abc'" is true, no matter where the string "s" originated from. - Assuming that E is an "ascii-compatible" encoding, the Ruby programmer doesn't have to face issues like "The value is #{val}" substitution failing because "val" is non-ascii compatible. - The "downside" as pointed out by a number of people is that not all characters may be transcoded cleanly or even be supported (driving without a seat-belt? :-)), but then programs requiring this level of control should probably not use this feature. Consequences of this suggestion: - Don't have to change the current implementation of encodings, String or Regexp - Avoids "automagical transcoding" within String & Regexp methods - Responsibility of implementing "default_internal" lies with a certain set of Ruby libraries like IO & networking Hope this makes sense. Mike