On 17/09/2008, James Gray <james / grayproductions.net> wrote:
> On Sep 16, 2008, at 8:20 PM, Michael Selig wrote:
>
>
> > I have been pulling my hair out trying to convert a relatively simple app
> to support m17n under Ruby 1.9 to see what is involved. I need to support
> all common locales worldwide, and data can also be stored in UTF-8 or
> UTF-16. I was hoping that Ruby 1.9 was going to take the hard work out of
> this for me. It has to a certain extent, but UTF-16 is the problem - it
> breaks so many things, due to its "ASCII incompatibility" (using Ruby's
> definition). I can't even do simple things like pull out fields and
> substitute into another string without testing "encoding compatibility".
> Something as simple as:
> >
> >        puts "The value is #{val}"
> >
> > fails if val is UTF-16 data.
> >
>
>  I'm not sure I support the pull-them out strategy, but I can confirm that
> supporting UTF-16 in CSV has eaten about a week of my time and counting.  I
> keep thinking I have it and finding new problemí─

For your own program you could override String.+ to automagically
convert its parameters. I thought this is good enough but you cannot
do that for libraries - ruby does not provide any way of bolting on
such feature and hiding it from users of the library so that they get
the standard behaviour.

Still there are multiple ways of combining strings, and these could be
used to distinguish different encoding handling.

So my suggestion is to make
 - String.+ do the conversion if possible (it creates a new string so
it can be different)
 - String.<< to only append compatible strings
 - I am not sure about string interpolation - it technically creates a
new string each time so it could just convert but this could get
complex if many stings are included in the interpolation.

Note that even with automatic conversion you get cases when strings
cannot be converted to some superset so somebody could break your
application that seems to work OK by supplying input in an exotic
encoding.

There are other string functions, though. It is unclear what
Object.inspect should do. It is generally used to show stuff to the
user. But should it convert the string to the user locale, show it in
hex with locale information appended, or what?

IO could be configurable to either do the necessary conversion or not. Like

STDOUT.autoconvert=true

then you could write any strings to stdout without problems (as long
as the stdout encoding is known and can handle all your strings).

Also Array.join could perhaps accept some parameter that either
specifies the desired encoding of the result or specifies that the
strings should be converted so that they can actually be concateneted.

Generally I can imagine the automatic conversion working like this
(either as part of core or as an addon):

1) each encoding has a list of compatible supersets

2) each encoding has a list of (incompatible) equivalents [optional] -
typical for legacy 8bit encodings which have several variants with the
characters reordered in different ways

3) each encoding has a list of incompatible (without conversion) supersets

Then string operations could be performed this way:

1) an operation on two strings where one is compatible superset of the
other is done without conversion, and the result has encoding of the
superset. This is basically the extension of the ASCII-compatible
concept to other encodings that could have this feature.

If conversion is not allowed and 1) is not applicable (note that each
encoding is compatible superset of itself) en exception is raised.

If conversion is allowed the autoconversion could follow:

2) if the strings ere encoded in incompatible but equivalent encodings
convert one to the encoding of the other based on some order of
preference.

3) if there is the same incompatible superset for both strings (or
superset of superset ..) convert both strings to this superset. If
multiple supersets are available consult order of preference.

If neither 2) nor 3) are applicable raise an exception.

I am not sure that 2) would ever apply. Some iso encodings should be
generally equivalent to some dos or windows codepages but there might
be one or two different characters that make the encodings
non-equivalent. Perhaps the strings could be checked for these
characters but then just converting to a superset might be easier.

Thanks

Michal