At 07:14 08/10/31, Michael Selig wrote:
>Hi,
>
>Feature #695 was closed & marked done, but unfortunately it does not seem  
>to have been implemented :-(

I think it should have been marked part done, part rejected,
I guess.

>The request was:
>
>> When combining 2 strings, with one being ASCII-8BIT, and the other is  
>> encoding "E":
>> 1) If the ASCII-8BIT string is valid if forced to encoding E, then treat  
>> the ASCII-8BIT string as being in encoding E;
>> 2) Otherwise treat both strings as ASCII-8BIT.
>>
>> Part (2) is less important, and can probably be omitted if it is hard to  
>> implement.

In my understanding, this would be a rather strong departure
from the current Ruby multilingual architecture, and not necessarily
a desirable one. It would be much more appropriate to start with
automatic conversion between labeled real encodings than to introduce
some conversion between arbitrary bytes and characters.
This distinction is already present in Ruby, you have to use
String#force_encoding in the above case, but String#encode
for actual conversion.

While things might 'just work' in some cases, treating arbitrary
ASCII-8BIT as a specific encoding if the byte pattern is okay
can result in many garbage-in-garbage-out cases. Some encodings
are more restrictive (e.g. UTF-8), but others, in particular all
single-byte encodings, have no restrictions at all.

I don't think it is by chance that most programming languages I
know, even if they have a somewhat different internationalization
model, more focused on Unicode than Ruby, make a clear distinction
between characters and bytes. It also isn't by chance that one
of the first things people have to learn when they learn about
internationalization is "bytes are not characters".

The above change would also be very difficult and tedious to
implement in Ruby currently. I was looking into this just a little
bit to see how easy it would be to implement automatic conversions
between actual character sets.

>However:
>
>ruby -Kn -ve 'p "abc\xD8\xB5" + "abc\u0635"'
>ruby 1.9.0 (2008-10-30 revision 20062) [i686-linux]
>-e:1:in `<main>': incompatible character encodings: ASCII-8BIT and UTF-8  
>(Encoding::CompatibilityError)
>
>(The -Kn is only necessary here because with -e ruby uses the locale to  
>determine the encoding of the string containing "\x".)
>I thought this feature was implemented very quickly!
>
>What appears to have been implemented is the encoding of "Array#pack"  
>output with "U".
>However, I am not totally convinced that even this was done correctly, as  
>the pack output seems now to be marked UTF-8 even if the pack option  
>contains a mixture of "U" with other options which then can result in an  
>invalid UTF-8 string.
>
>My feature request would mean that "pack" and "\x" string literals could  
>be left as ASCII-8BIT, and be "forced" to another encoding transparently  
>depending on how the programmer uses it.

I think this is totally the wrong way. The problems are with
pack and \x in string literals, and it would be a bad idea to
try and solve them by introducing a general "bytes become characters"
feature.


>You can liken this feature to the transparent conversion of an integer to  
>a float when doing arithmetic.

Well, it's not very similar. The conversion of an interger to a float
is very predictable, but the 'conversion' of ASCII-8BIT to some
real encoding is just a wild guess.


>If you agree that this is a good idea, I don't mind trying to produce a  
>patch for it myself. Please let me know.

I don't know about Matz or Nobu, but I don't think at all that this
is the way to go.

Regards,   Martin.


>
>Cheers
>Mike
>
>On Wed, 29 Oct 2008 14:53:15 +1100, Michael Selig <redmine / ruby-lang.org>  
>wrote:
>
>> Feature #695: More flexibility when combining ASCII-8BIT strings with  
>> other encodings
>> http://redmine.ruby-lang.org/issues/show/695
>>
>> Author: Michael Selig
>> Status: Open, Priority: Normal
>> Category: M17N
>>
>> Consider the following 3 Ruby statements:
>>
>> # String#pack always returns ASCII-8BIT
>> s1 = [97, 98, 99, 1589].pack("U*")
>>
>> # \xNN returns the source encoding (even if it is an invalid string), or  
>> ASCII-8BIT if not set
>> s2 = "abc\xD8\xB5"
>>
>> # \uNNNN always returns UTF-8
>> s3 = "abc\u0635"
>>
>> All of s1, s2, and s3 have the same contents, but different encodings.  
>> When you try to combine them, you get different "encoding compatibility"  
>> problems, which can change depending on the source encoding, due to the  
>> treatment of s2.
>>
>> I would like to see Ruby be able to combine all the above without error.  
>> I don't think it is reasonable to have to use "force_encoding" in these  
>> cases. This would
>> - give better compatibility with 1.8,
>> - make handling of methods returning ASCII-8BIT strings much easier (eg  
>> String#pack and libraries which return strings in ASCII-8BIT because the  
>> encoding is unknown)
>> - reduce the confusion caused with "\x" producing a string which depends  
>> on the source encoding (which I dislike - I think it should always  
>> return ASCII-8BIT).
>>
>> So the feature request is:
>>
>> When combining 2 strings, with one being ASCII-8BIT, and the other is  
>> encoding "E":
>> 1) If the ASCII-8BIT string is valid if forced to encoding E, then treat  
>> the ASCII-8BIT string as being in encoding E;
>> 2) Otherwise treat both strings as ASCII-8BIT.
>>
>> Part (2) is less important, and can probably be omitted if it is hard to  
>> implement.
>>
>> Thank you
>> Michael Selig
>>
>>
>> ----------------------------------------
>> http://redmine.ruby-lang.org
>
>
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp