At 13:57 08/10/31, Michael Selig wrote:
>Hi
>
>On Fri, 31 Oct 2008 13:51:53 +1100, Martin Duerst <duerst / it.aoyama.ac.jp>  
>wrote:
>
>>> Feature #695 was closed & marked done, but unfortunately it does not  
>>> seem to have been implemented :-(
>>
>> I think it should have been marked part done, part rejected,
>> I guess.
>
>Some sort of explanation would also have been nice.

Sometimes things just happen. Often, that's enough, and
if not, it's always possible to ask (as you did).

Bug tracking systems give the impression of perfection,
but one always has to remember that they are only an
attempt.

>But at least we are now discussing it - I was expecting this to happen  
>before implementation :-)
>
>> I don't think it is by chance that most programming languages I
>> know, even if they have a somewhat different internationalization
>> model, more focused on Unicode than Ruby, make a clear distinction
>> between characters and bytes. It also isn't by chance that one
>> of the first things people have to learn when they learn about
>> internationalization is "bytes are not characters".
>
>Yes, I agree with you, and I have raised this "ambiguity" before - in Ruby  
>ASCII-8BIT can either be a byte string or a character string of uncertain  
>encoding.



>The problem I am trying to address here is for simple scripts which don't  
>care about internationalisation.

Well, we could make some simple scripts simpler, but only at the
expense of making bigger scripts much more brittle. In my opinion,
once you use \x string escapes or pack, you have to know about the
distinction between bytes and characters, and should be able to
add the necessary force-encoding (or whatever else is needed).


>>> My feature request would mean that "pack" and "\x" string literals could
>>> be left as ASCII-8BIT, and be "forced" to another encoding transparently
>>> depending on how the programmer uses it.
>>
>> I think this is totally the wrong way. The problems are with
>> pack and \x in string literals, and it would be a bad idea to
>> try and solve them by introducing a general "bytes become characters"
>> feature.
>
>"default_internal" has gone a long way to help solve M17N issues, but  
>there still remains "encoding compatibility" issues even in simple, single  
>encoding scripts, ie: between the locale's encoding and ASCII-8BIT. The  
>motivation behind this feature request was to address this latter point.
>
>I agree with you that there is a problem with "\x" in string literals.
>
>However I am not sure I agree that the problem is in pack. The root of the  
>problem is this ambiguity with ASCII-8BIT between bytes and characters -  
>the way I think it should work is really like a "wild card" encoding.

Well, I think there is a problem in pack. It has so many different
template characters that it's impossible in general to say what
encoding the result should be. Matz did some followup work on
your proposal at revision 20057
(see http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/pack.c?view=log),
which tries to get the best result possible for simple cases.
For cases that use many different template characters at the
same time, it's simply impossible to figure out what the intent
of the programmer is, so the programmer will have to tell.


>Pack is one simple example of a bunch of methods that return strings, but  
>cannot easily determine what encoding to return them in.

I'd guess pack is one of the more complex ones. If you know others,
please tell us, I think nobody is claiming that all i's are dotted
and all t's crossed in this area.

>Other examples are decryption and uncompression methods where often the  
>original encoding is not known. In many cases there is no alternative  
>other than to return them as ASCII-8BIT and let the application worry  
>about interpreting the contents.
>
>This is *forcing* the programmer to use "force_encoding()"

Or whatever else is appropriate.

>where in 1.8 it  
>was not necessary, and in 1.9 it can seem rather annoying.

It can seem annoying until you realize that it's necessary.

>There is even a weird exception to this - if the ASCII-8BIT string happens  
>to be all 7-bit chars, then it CAN be combined with other ASCII-compatible  
>encodings.

Yes, that's one point where it may make sense to split ASCII-8BIT
and BINARY.

>This probably allows some 1.8 legacy scripts to work, but only ones  
>working in ASCII.
>I do not think this sort of thing - one that works in some cases, but not  
>in others - is desirable at all.

Yes, but in my view, you are just proposing to go down the slippery
slope a bit further. The chances that ASCII is ASCII (and that otherwise,
you'll find out pretty quickly when looking at the data) are much
heigher than the chances that any more specific encoding will be
'guessed' right.

>So in fact Ruby already has what you describe as a "bytes become  
>characters feature", but it only works in certain circumstances!
>
>
>>> You can liken this feature to the transparent conversion of an integer  
>>> to
>>> a float when doing arithmetic.
>>
>> Well, it's not very similar. The conversion of an interger to a float
>> is very predictable, but the 'conversion' of ASCII-8BIT to some
>> real encoding is just a wild guess.
>
>A "wild guess" is overstating it. If a program attempts to combine an  
>ASCII-8BIT string with another encoded string, AND it happens to be a  
>valid encoding, I think that the chances are very high that the program is  
>expecting the byte string to be in the other encoding. I think that a  
>heuristic like this is reasonable as it keeps the language backward  
>compatible & neat.
>
>Furthermore as I said, this conversion already happens with ASCII-8BIT  
>character strings consisting only of 7 bit chars,

Well, yes, but then that's clearly reflected in the name "ASCII-8BIT".

>so extending it to all  
>encodings seems an obvious thing to do. Look at:
>
>a) 7-bit char strings work, irrespective of encoding:
>ruby -e 'p ("abc".force_encoding("ASCII-8BIT") +  
>"abc".force_encoding("UTF-8")).encoding'
>=> #<Encoding:UTF-8>
>
>but:
>b) Legal 8-bit encoding string:
>ruby -e 'p ("ab\xE0".force_encoding("ASCII-8BIT") +  
>"ab\xE0".force_encoding("ISO-8859-8")).encoding'
>=> -e:1:in `<main>': incompatible character encodings: ASCII-8BIT and  
>ISO-8859-8 (Encoding::CompatibilityError)
>
>c) Legal multibyte encoding string:
>ruby -ve 'p ("ab\u0635".force_encoding("ASCII-8BIT") +  
>"ab\u0635".force_encoding("UTF-8")).encoding'
>=> -e:1:in `<main>': incompatible character encodings: ASCII-8BIT and  
>UTF-8 (Encoding::CompatibilityError)

I think you have to come up with much more realistic examples
than these.


>Certainly I don't see the downside in the conversion to a single-byte  
>encoding (eg: example (b)) above. Even if it converted when it shouldn't  
>have, the indexing and "codepoint values" are the same as if the result  
>were ASCII-8BIT.

The bytes are of course the same. But what counts is whether we have
the right characters.

>One other idea: maybe we should distinguish between 2 encodings "BINARY"  
>and "ASCII-8BIT", which are currently aliases. Essentially they are the  
>same, but "BINARY" would mean "bytestring" and will generate an error if  
>you try to combine it with any other encoding, while "ASCII-8BIT" would  
>mean "unknown encoding", which can be combined transparently with other  
>encodings.

See separate mail on this topic.


Regards,   Martin.

>Maybe there is a better solution - any ideas?
>
>Cheers
>Mike
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst / it.aoyama.ac.jp