On 17/09/2008, NARUSE, Yui <naruse / airemix.jp> wrote:
> Hi,
>
>  Michal Suchanek wrote:
>
> > For your own program you could override String.+ to automagically
> > convert its parameters. I thought this is good enough but you cannot
> > do that for libraries - ruby does not provide any way of bolting on
> > such feature and hiding it from users of the library so that they get
> > the standard behaviour.
> >
> > Still there are multiple ways of combining strings, and these could be
> > used to distinguish different encoding handling.
> >
> > So my suggestion is to make
> >  - String.+ do the conversion if possible (it creates a new string so
> > it can be different)
> >
>
>  The problem is not "can convert" or "cannot convert".
>  Different mappings and information lost in conversion is the true problem.
>  So they can't be avoided and we can't use automatic conversion.

Yes, even for the "common www encodings" one cannot convert some
Japanese encodings with the Yen vs backslash confusion safely. And
there are other problems I am sure.

>
>
> > Note that even with automatic conversion you get cases when strings
> > cannot be converted to some superset so somebody could break your
> > application that seems to work OK by supplying input in an exotic
> > encoding.
> >
>
>  Difinition of superset is difficult problem.
>
>
..
> > STDOUT.autoconvert=true
> >
>
>  this seems non-thread-safe.

Using a single IO in multiple threads is non-safe so this API does not
introduce any new problem. It is also similar to the other IO
properties that can be already set in non-thread-safe way.

>
>
> > Generally I can imagine the automatic conversion working like this
> > (either as part of core or as an addon):
> >
> > 1) each encoding has a list of compatible supersets
> >
>
>  Define "compatible" is this problem.
>  And what is "incompatible"?

Compatible here means that 7bit ASCII is compatible subset of utf-8 or
any (most?) of the iso-8859-x encodings. You can join the strings
without any conversion. Similarily BCDIC could be considered
compatible subset of the EBCDIC codepages if these are ever
implemented.

>
>
> > 2) each encoding has a list of (incompatible) equivalents [optional] -
> > typical for legacy 8bit encodings which have several variants with the
> > characters reordered in different ways
> >
>
>  Such extension sometimes breaks compatibility.
>  For example, U+9AD8 is assigned in 0x3962 in Shift_JIS.
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9ad8
>
>  this character has a variation, which has a codepoint U+9aD9 in Unicode.
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9ad9
>  But this is unified in Shift_JIS(JIS X 0208).
>
>  So 0x3962 includes U+9AD8 and U+9AD9 but once it is converted to Unicode,
>  this is only be U+9AD8.
>
>  Moreover Windows Code Page 932 include U+9AD9...
>
>  So this is not easy.

If there are characters that are different in one encoding and mapped
to a single codepoint  in another encoding these are not equivalent.
Strings in those encodings could be considered equivalent as long as
they do not contain such characters but it is questionable if scanning
the string is desired. On the other hand, the conversion would process
the whole string anyway so it could be attempted for such cases, and
aborted if such character is encountered.

>
> > 3) each encoding has a list of incompatible (without conversion) supersets
> >
> > Then string operations could be performed this way:
> >
> > 1) an operation on two strings where one is compatible superset of the
> > other is done without conversion, and the result has encoding of the
> > superset. This is basically the extension of the ASCII-compatible
> > concept to other encodings that could have this feature.
> >
>
>  The problem is we don't kwno what encodings are compatible as ASCII.

That's aways possible to know only by looking at the codepoint table,
the same as the ascii-compatible encodings were defined.

>
>
> > If conversion is allowed the autoconversion could follow:
> >
>
>  implement "switch to allow the autoconversion" seems difficult... anyway

I did not mean to implement a switch - I wanted to define converting
and non-converting operations. However, for non-string objects that
use strings such switch would be indeed needed.

>
>
> > 2) if the strings ere encoded in incompatible but equivalent encodings
> > convert one to the encoding of the other based on some order of
> > preference.
> >
>
>  This means,
>  when charset C includes A and B, string in A + string in B => string in C ?
>  When those conversion doesn't lost any information, this is reasonable.

Here I wanted to distinguish two cases but they are in fact pretty
much the same:
 - conversion into an encoding that has the same number of codepoints,
just reordered
 - conversion into an encoding with larger number of codepoints

This should be probably handled by encoding preference.

When strings in iso-8859-x and the corresponding windows codepage
should be added, and the windows codepage is preferred over Unicode
encodings and iso-8859 encodings the codepage should be used. On the
other hand, if utf-8 is preferred utf-8 should be used for the result.

I am not sure how that preference would be set, though.

You could set general preference at program start but setting
preference for each operation  would make the system complicated. But
for a single operation the preference could be enforced by converting
the operands manually.

>
>
> > 3) if there is the same incompatible superset for both strings (or
> > superset of superset ..) convert both strings to this superset. If
> > multiple supersets are available consult order of preference.
> >
>
>  What is incompatible superset mean?

That means that at the string would have to be converted to be
represented in the "superset encoding". However, the conversion should
be unambiguous.

>
>
> > I am not sure that 2) would ever apply. Some iso encodings should be
> > generally equivalent to some dos or windows codepages but there might
> > be one or two different characters that make the encodings
> > non-equivalent. Perhaps the strings could be checked for these
> > characters but then just converting to a superset might be easier.
> >
>
>  Theoretically it is yes.
>  But practical encodings seem dirty.
>

Thanks

Michal