On 6/20/06, Yukihiro Matsumoto <matz / ruby-lang.org> wrote: > Hi, > > In message "Re: Unicode roadmap?" > on Tue, 20 Jun 2006 23:33:43 +0900, "Michal Suchanek" <hramrach / centrum.cz> writes: > > |No, I meant that the strings are, of course, converted to a common > |encoding such as utf-8 before they are concatenated. > |The point is that you do not have to care in which encoding you > |obtained the pieces and convert them manually to a common encoding if > |the string class can do it automatically for you. > > If you choose to convert all input text data into Unicode (and convert > them back at output), there's no need for unreliable automatic > conversion. Well, it's actually you who chose the conversion on input for me. Since the strings aren't automatically converted I have to ensure that I have always strings encoded using the same encoding. And the only reasonable way I can think of is to convert any string that enters my application (or class) to an arbitrary encoding I choose in advance. This is no more reliable than automatic conversion. The reliability or (un)reliability of the conversion is based on the (un)reliability with which the actual encoding of the string is determined when it is obtained. If the encoding tag is wrong the string will be converted incorrectly. It is the only cause for incorrect conversion wether it happens manually or automatically. If conversion was done automatically by the string class it could be performed lazily. The strings are kept in the encoding in which the were obtained, and only converted when it is needed because they are combined with a string in a different encoding. And users of the srings still have the choice to convert them explicitly when they see fit. When such automatic conversion is not available it makes interfacing with libraries that fetch external data more difficult. a) I could instruct the library that fetches data from a database or the web to return them always in the encoding I chose for reperesenting strings in my application, irregardless of the encoding the data was originally obtained in. The disadvantage is that if the encoding was determined incorrectly on input to the library the data is already garbled. b) I could get the data from the library in the original encoding in which it was obtained. Either because I would like to check that the encoding is correct before converting the data or because the library does not implement the interface for (a). The disadvantage is that I have to traverse a potentially complex data structure and convert all strings so that they work with the other strings inside my application. c) Every time I perform a string operation I should first check (manually) that the two strings are compatible (or catch the exception very near the opration so that I can convert the arguments and retry). I do not think this is a reasonable option for the common case that should be made as simple as possible: the strings can be represented in Unicode. This may be necessary to some extent in applications dealing with encodings that are incompatible with Unicode but it should not be required for the common case. The people with experience from other languages are complaining that they have to do (b) or (c) because (a) is usually not implemented. And ensuring either of the three does look like additional problems that could be solved elsewhere - in the string class. Thanks Michal