Hi,

Michal Suchanek wrote:
> For your own program you could override String.+ to automagically
> convert its parameters. I thought this is good enough but you cannot
> do that for libraries - ruby does not provide any way of bolting on
> such feature and hiding it from users of the library so that they get
> the standard behaviour.
> 
> Still there are multiple ways of combining strings, and these could be
> used to distinguish different encoding handling.
> 
> So my suggestion is to make
>  - String.+ do the conversion if possible (it creates a new string so
> it can be different)

The problem is not "can convert" or "cannot convert".
Different mappings and information lost in conversion is the true problem.
So they can't be avoided and we can't use automatic conversion.

> Note that even with automatic conversion you get cases when strings
> cannot be converted to some superset so somebody could break your
> application that seems to work OK by supplying input in an exotic
> encoding.

Difinition of superset is difficult problem.

> There are other string functions, though. It is unclear what
> Object.inspect should do. It is generally used to show stuff to the
> user. But should it convert the string to the user locale, show it in
> hex with locale information appended, or what?

In previous conversation, Object#inspect should be dependent from locale.

> STDOUT.autoconvert=true

this seems non-thread-safe.

> Generally I can imagine the automatic conversion working like this
> (either as part of core or as an addon):
> 
> 1) each encoding has a list of compatible supersets

Define "compatible" is this problem.
And what is "incompatible"?

> 2) each encoding has a list of (incompatible) equivalents [optional] -
> typical for legacy 8bit encodings which have several variants with the
> characters reordered in different ways

Such extension sometimes breaks compatibility.
For example, U+9AD8 is assigned in 0x3962 in Shift_JIS.
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9ad8

this character has a variation, which has a codepoint U+9aD9 in Unicode.
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9ad9
But this is unified in Shift_JIS(JIS X 0208).

So 0x3962 includes U+9AD8 and U+9AD9 but once it is converted to Unicode,
this is only be U+9AD8.

Moreover Windows Code Page 932 include U+9AD9...

So this is not easy.
# ISO-8859-X may easy

> 3) each encoding has a list of incompatible (without conversion) supersets
> 
> Then string operations could be performed this way:
> 
> 1) an operation on two strings where one is compatible superset of the
> other is done without conversion, and the result has encoding of the
> superset. This is basically the extension of the ASCII-compatible
> concept to other encodings that could have this feature.

The problem is we don't kwno what encodings are compatible as ASCII.

> If conversion is allowed the autoconversion could follow:

implement "switch to allow the autoconversion" seems difficult... anyway

> 2) if the strings ere encoded in incompatible but equivalent encodings
> convert one to the encoding of the other based on some order of
> preference.

This means,
when charset C includes A and B, string in A + string in B => string in C ?
When those conversion doesn't lost any information, this is reasonable.

> 3) if there is the same incompatible superset for both strings (or
> superset of superset ..) convert both strings to this superset. If
> multiple supersets are available consult order of preference.

What is incompatible superset mean?

> I am not sure that 2) would ever apply. Some iso encodings should be
> generally equivalent to some dos or windows codepages but there might
> be one or two different characters that make the encodings
> non-equivalent. Perhaps the strings could be checked for these
> characters but then just converting to a superset might be easier.

Theoretically it is yes.
But practical encodings seem dirty.

-- 
NARUSE, Yui  <naruse / airemix.jp>