On 6/20/06, Yukihiro Matsumoto <matz / ruby-lang.org> wrote:
> Hi,
>
> In message "Re: Unicode roadmap?"
>     on Tue, 20 Jun 2006 23:33:43 +0900, "Michal Suchanek" <hramrach / centrum.cz> writes:
>
> |No, I meant that the strings are, of course, converted to a common
> |encoding such as utf-8 before they are concatenated.
> |The point is that you do not have to care in which encoding you
> |obtained the pieces and convert them manually to a common encoding if
> |the string class can do it automatically for you.
>
> If you choose to convert all input text data into Unicode (and convert
> them back at output), there's no need for unreliable automatic
> conversion.

Well, it's actually you who chose the conversion on input for me.
Since the strings aren't automatically converted I have to ensure that
I have always strings encoded using the same encoding. And the only
reasonable way I can think of is to convert any string that enters my
application (or class) to an arbitrary encoding I choose in advance.

This is no more reliable than automatic conversion. The reliability or
(un)reliability of the conversion is based on the (un)reliability with
which the actual encoding of the string is determined when it is
obtained. If the encoding tag is wrong the string will be converted
incorrectly. It is the only cause for incorrect conversion wether it
happens manually or automatically.

If conversion was done automatically by the string class it could be
performed lazily. The strings are kept in the encoding in which the
were obtained, and only converted when it is needed because they are
combined with a string in a different encoding. And users of the
srings still have the choice to convert them explicitly when they see
fit.

When such automatic conversion is not available it makes interfacing
with libraries that fetch external data more difficult.

a) I could instruct the library that fetches data from a database or
the web to return them always in the encoding I chose for
reperesenting strings in my application, irregardless of the encoding
the data was originally obtained in.
The disadvantage is that if the encoding was determined incorrectly on
input to the library the data is already garbled.

b) I could get the data from the library in the original encoding in
which it was obtained. Either because I would like to check that the
encoding is correct before converting the data or because the library
does not implement the interface for (a).
The disadvantage is that I have to traverse a potentially complex data
structure and convert all strings so that they work with the other
strings inside my application.

c) Every time I perform a string operation I should first check
(manually) that the two strings are compatible (or catch the exception
very near the opration so that I can convert the arguments and retry).
I do not think this is a reasonable option for the common case that
should be made as simple as possible: the strings can be represented
in Unicode. This may be necessary to some extent in applications
dealing with encodings that are incompatible with Unicode but it
should not be required for the common case.

The people with experience from other languages are complaining that
they have to do (b) or (c) because (a) is usually not implemented. And
ensuring either of the three does look like additional problems that
could be solved elsewhere - in the string class.

Thanks

Michal