Hi,

Sorry, perhaps I have been giving a (bad) solution, rather than stating  
the problem clearly, so let me try again!
I certainly didn't mean to suggest there should be any transcoding of  
string literals by Ruby's parser.

So here are the problems as I see them. They are all to do with the  
default encoding of string literals, and they are all fairly minor, but I  
think addressing them has merit:

1) The encoding of string literals constructed with "\x..." is ambiguous.
Well not strictly ambiguous, but certainly it can be confusing. The  
trouble is that a string literal like the example in bug #680  
"\x82\xA0,\x82\xA2" can either be used as a "binary" string (ASCII-8BIT)  
or an encoded character string (intended to be Shift_JIS in this case),  
but this depends on the source encoding. While technically these are the  
same data, they are used in quite different ways in practice. Also, as we  
see in the bug report, it can cause mysterious errors such as "Bad UTF-8  
string" because the source encoding was apparently UTF-8 not Shift_JIS  
(thank you to Martin for pointing this out).

Ruby treats strings constucted with "\u..." differently: they are set to  
UTF-8 no matter what the source encoding. I think this is the correct  
behaviour - there is no ambiguity. But "\x..." is not treated like this.  
When the source encoding is not specified (or is US-ASCII), a "\x.."  
string is set to ASCII-8BIT. Again I think this is the correct behaviour.  
However if the source encoding is set to anything else, the encoding of  
the string is set to the source encoding. I think this is the part that is  
wrong, especially as the resultant string can be "broken", and no warning  
is given about this by the parser.

My preference would be to *always* encode string literals constructed with  
"\x.." as ASCII-8BIT, ignoring the source encoding. This means that if you  
really want to use such a literal as an encoded string, you must use  
"force_encoding". I think this would be much clearer and get rid of the  
"ambiguity".

2) I find it slightly redundant to have to specify BOTH the  
default_internal, and the source encoding at the top of an m17n script  
which contains multibyte string literals, when in all practical cases they  
should be the same. eg:

	#! /usr/bin/ruby -E:UTF-8
	# encoding: UTF-8

My suggestion for "defaulting" the source encoding was an attempt to avoid  
having to do this (but probably not a good way!). It isn't a big deal, and  
I understand the argument that the source encoding is a property of the  
script. My original suggestion (last month) of a special magic comment was  
to have a way of specifying BOTH the default_internal and source encoding  
once, but this idea was rejected.

3) I think there should be some check (warning message?) that the (non  
ASCII-8BIT) string literals in a library file are compatible with the  
"default_internal" of the calling program (if it is set). Ideally this  
check would be done when the "require" is called to flag possible  
incompatibilities early.
Perhaps this check could be based on the library's source encoding? If  
this were done, most libraries would have to use a source encoding of  
US-ASCII (or just have no encoding magic comment) *not* UTF-8, so that  
non-Unicode default_internal's will work. Perhaps Ruby could be smarter,  
and only flag an error if there actually is an incomaptible string literal  
in the library?

4) I was surprised at the different source encoding behaviour when using  
"-e" compared to a script in a file. (Again thank you to Martin for  
telling me about it)
Matz wrote:

> -e takes programs from command line shell, which probably yields
> strings in locale encoding anyway.  But we cannot assume that for
> scripts contained in files.

Again I understand the sentiment, but for a simple non-m17n, non-ascii  
ruby script that was likely written with an editor on the same machine or  
in the same locale, why force it to have an "encoding" magic comment?

Also it means that:
	ruby test.rb
may perform differently than:
	ruby -e "`cat test.rb`"

Again potentially confusing, but not a big deal.

I hope I have made myself clearer this time!

Thanks,
Mike.