[most of this mail, and some others, was written Wednesday,
and so may repeat some of what Matz and others have said,
but I had big problems getting mail out.]

At 13:21 08/09/17, Michael Selig wrote:

>I have been doing some more thinking about these ongoing issues....
>
><soapbox>
>
>Using Ruby SHOULD be making our lives easier, not harder.

Very much so.

>Other languages  
>like Python have taken an easier route to m17n - represent all strings  
>internally as unicode codepoints. Then there should never be a need to  
>check encoding compatibility, right?

Yes. The requirement is that you have to make sure your application
knows what encoding it's dealing with, and that you have to make sure
you can convert everything, even 'private use' characters appearing
with a certain frequency in East Asian encodings.

>I am not saying that this is a  
>perfect solution either, by the way. But having to work around this  
>"Encoding Compatibility Error" all the time is just a pain for apps which  
>need to work in different countries with different locales. Unfortunately  
>it is leading me towards the path of having to transcode everything to  
>UTF-8, even though in 99% of cases all the data IS going to be compatible  
>and be in the user's locale. I don't want so much of my time taken up, and  
>be forced to write ugly code to take care of the remaining 1%.

In my view, you either have a true single-encoding situation, in which
case Ruby should work great, or you have a mixed-encoding situation.
And even 1% of "other" encodings means a mixed situation.

In a mixed situation, going "Unicode inside" (which for Ruby means
"UTF-8 inside") is the best thing to do in most cases. Unicode inside
is a model that many, many applications and several programming languages
have choosen for many good reasons. Ruby currently supports it, but not
as seamlessly as it could. Getting more input about where things
hurt most is very helpful.

There are probably two things that differ from "all Unicode inside"
programming languages such as Perl, Python, and Java:

- Because Ruby allows you to use all kinds of non-Unicode encodings,
  it may give the impression that things work with mixed encodings,
  and lets you postpone some necessary cleanup that you'd otherwise do
  upfront.

- When reading data, in Java and friends, you only have to indicate
  the external encoding. In Ruby, you have to mention UTF-8, too,
  because otherwise the encoding is used just as a label, without
  conversion. For a "Unicode inside" application, that's an additional
  burden. [I'm glad to see that Matz thinks that's ugly, too,
  and wants to do something about it in the future.]

I have suggested that we introduce some kind of
"encoding policy" that lets some things happen "automagically".
(see http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby/Paper.html,
Section 6). One such policy could be "whenever you might get an
exception due to an encoding mismatch, try to transcode (e.g., to
UTF-8). Another could be "transcode all input to UTF-8 unless
there is a specific indication that another encoding is wanted".

The main problem with such an approach is that it's very difficult
to do this globally, because libraries may have very different
assumptions or restrictions, and Ruby doesn't have a 'per library'
concept.

My understanding is that similar problems can happen with class
extensions (two different libraries adding or changing methods
with the same name in the same class,..., or one library depending
on a change where another depends on having nothing changed,...),
and that some solution to this problem is one of the things that
Matz mentioned when talking about Ruby 2.0. If such a solution
would indeed happen, I guess it wouldn't be too difficult to
also use that solution for dealing with "encoding policies".
But all this is currently just some vague feeling, none of it
exists in actual code.


Regards,   Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp     mailto:duerst / it.aoyama.ac.jp