Hi

On Fri, 31 Oct 2008 13:51:53 +1100, Martin Duerst <duerst / it.aoyama.ac.jp>  
wrote:

>> Feature #695 was closed & marked done, but unfortunately it does not  
>> seem to have been implemented :-(
>
> I think it should have been marked part done, part rejected,
> I guess.

Some sort of explanation would also have been nice.
But at least we are now discussing it - I was expecting this to happen  
before implementation :-)

> I don't think it is by chance that most programming languages I
> know, even if they have a somewhat different internationalization
> model, more focused on Unicode than Ruby, make a clear distinction
> between characters and bytes. It also isn't by chance that one
> of the first things people have to learn when they learn about
> internationalization is "bytes are not characters".

Yes, I agree with you, and I have raised this "ambiguity" before - in Ruby  
ASCII-8BIT can either be a byte string or a character string of uncertain  
encoding.
The problem I am trying to address here is for simple scripts which don't  
care about internationalisation.

>> My feature request would mean that "pack" and "\x" string literals could
>> be left as ASCII-8BIT, and be "forced" to another encoding transparently
>> depending on how the programmer uses it.
>
> I think this is totally the wrong way. The problems are with
> pack and \x in string literals, and it would be a bad idea to
> try and solve them by introducing a general "bytes become characters"
> feature.

"default_internal" has gone a long way to help solve M17N issues, but  
there still remains "encoding compatibility" issues even in simple, single  
encoding scripts, ie: between the locale's encoding and ASCII-8BIT. The  
motivation behind this feature request was to address this latter point.

I agree with you that there is a problem with "\x" in string literals.

However I am not sure I agree that the problem is in pack. The root of the  
problem is this ambiguity with ASCII-8BIT between bytes and characters -  
the way I think it should work is really like a "wild card" encoding.
Pack is one simple example of a bunch of methods that return strings, but  
cannot easily determine what encoding to return them in.
Other examples are decryption and uncompression methods where often the  
original encoding is not known. In many cases there is no alternative  
other than to return them as ASCII-8BIT and let the application worry  
about interpreting the contents.

This is *forcing* the programmer to use "force_encoding()" where in 1.8 it  
was not necessary, and in 1.9 it can seem rather annoying.
There is even a weird exception to this - if the ASCII-8BIT string happens  
to be all 7-bit chars, then it CAN be combined with other ASCII-compatible  
encodings.
This probably allows some 1.8 legacy scripts to work, but only ones  
working in ASCII.
I do not think this sort of thing - one that works in some cases, but not  
in others - is desirable at all.

So in fact Ruby already has what you describe as a "bytes become  
characters feature", but it only works in certain circumstances!


>> You can liken this feature to the transparent conversion of an integer  
>> to
>> a float when doing arithmetic.
>
> Well, it's not very similar. The conversion of an interger to a float
> is very predictable, but the 'conversion' of ASCII-8BIT to some
> real encoding is just a wild guess.

A "wild guess" is overstating it. If a program attempts to combine an  
ASCII-8BIT string with another encoded string, AND it happens to be a  
valid encoding, I think that the chances are very high that the program is  
expecting the byte string to be in the other encoding. I think that a  
heuristic like this is reasonable as it keeps the language backward  
compatible & neat.

Furthermore as I said, this conversion already happens with ASCII-8BIT  
character strings consisting only of 7 bit chars, so extending it to all  
encodings seems an obvious thing to do. Look at:

a) 7-bit char strings work, irrespective of encoding:
ruby -e 'p ("abc".force_encoding("ASCII-8BIT") +  
"abc".force_encoding("UTF-8")).encoding'
=> #<Encoding:UTF-8>

but:
b) Legal 8-bit encoding string:
ruby -e 'p ("ab\xE0".force_encoding("ASCII-8BIT") +  
"ab\xE0".force_encoding("ISO-8859-8")).encoding'
=> -e:1:in `<main>': incompatible character encodings: ASCII-8BIT and  
ISO-8859-8 (Encoding::CompatibilityError)

c) Legal multibyte encoding string:
ruby -ve 'p ("ab\u0635".force_encoding("ASCII-8BIT") +  
"ab\u0635".force_encoding("UTF-8")).encoding'
=> -e:1:in `<main>': incompatible character encodings: ASCII-8BIT and  
UTF-8 (Encoding::CompatibilityError)

Certainly I don't see the downside in the conversion to a single-byte  
encoding (eg: example (b)) above. Even if it converted when it shouldn't  
have, the indexing and "codepoint values" are the same as if the result  
were ASCII-8BIT.

One other idea: maybe we should distinguish between 2 encodings "BINARY"  
and "ASCII-8BIT", which are currently aliases. Essentially they are the  
same, but "BINARY" would mean "bytestring" and will generate an error if  
you try to combine it with any other encoding, while "ASCII-8BIT" would  
mean "unknown encoding", which can be combined transparently with other  
encodings.
Maybe there is a better solution - any ideas?

Cheers
Mike