On 6/19/06, Yukihiro Matsumoto <matz / ruby-lang.org> wrote:
> Hi,
>
> In message "Re: Unicode roadmap?"
>     on Mon, 19 Jun 2006 14:57:22 +0900, "Dmitry Severin" <dmitry.severin / gmail.com> writes:
>
> |But, I can see several imlementation issues and possible options, that
> |should be considered:
>
> Thank you for the ideas.
>
> |- what will happen if one tries to perfom str1.operation(str2) on two
> |strings with different encodings:
> |  a) raise exception
> |  b) silent coerce one or both strings to some "compatible"
> |charset/encoding, update encoding of result, replacing non-convertable chars
> |using fallback mappings? (ouch, this can be split to set of options)
> |  c) same as b) but raise exception if non-loss conversion is not possible?
> |  d) same as b) but warn if non-loss conversion is not possible?
> |  e) downgrade encoding tag of acceptor to "raw/bytes" and process it?
>
> a), unless either of strings is "ascii" and the other is "ascii"
> compatible.  This point is arguable.

What is "ascii"? Specifically I would like string operations to suceed
in cases when both strings are encoded as different subset of Unicode
(or anything else). ie concatenating an ISO-8859-2 and an ISO-8859-1
string sould result in UTF-* string, not an error.

However, this would make the errors from incompatible encodings more
surprising as they would be very infrequent.

I wonder what operations on raw strings (ones without specified
encoding) would do. Or where one of the strings is raw, and the other
is not.


>
> |- what to do with IO:
> |  a) IO will return strings in "raw/bytes"?
> |  b) IO can be tagged and will return Strings with given econding tag?
> |  c) IO can be tagged and is by default tagged with global encoding tag?
> |  d) IO can be tagged, but is not tagged by default, although methods
> |returning strings (such as read, readlines) will use global encoding tag?
> |  e) if IO is tagged and one tries to write to it a String with different
> |encoding, what will happen?
>
> c), the global default shall be set from locale setting.
>

I am not sure this is good for network IO as well. For diagnostics it
might be useful to set the default to none, and have string raise an
exception when such strings are combined with other strings.

It is only obvious for STDIN and STDOUT that they should follow the
locale setting.

hmm, but it would need to carefully consider which operations should
work on raw strings and which not. Perhaps it is not as nice as it
looks at the first glance.

Thanks

Michal