On Aug 7, 2009, at 10:41 AM, V=EDt Ondruch wrote:

>> You are not allowed to set the source encoding to a non-ASCII
>> compatible encoding, if memory serves.
>
> Where is it documented please?

I'm not sure it's officially documented yet.

Ruby does throw an error in this scenario though:

$ ruby_dev
# encoding: UTF-16BE
ruby_dev: UTF-16BE is not ASCII compatible (ArgumentError)

and:

$ ruby_dev -e 'puts "\uFEFF# encoding: UTF-16BE".encode("UTF-16BE")' | =20=

ruby_dev
-:1: invalid multibyte char (UTF-8)

I believe this is the relevant code from Ruby's parser:

static void
parser_set_encode(struct parser_params *parser, const char *name)
{
     int idx =3D rb_enc_find_index(name);
     rb_encoding *enc;

     if (idx < 0) {
	rb_raise(rb_eArgError, "unknown encoding name: %s", name);
     }
     enc =3D rb_enc_from_index(idx);
     if (!rb_enc_asciicompat(enc)) {
	rb_raise(rb_eArgError, "%s is not ASCII compatible", =
rb_enc_name(enc));
     }
     parser->enc =3D enc;
}

>> That eliminates any issues
>> with encodings like UTF-16.  This makes perfect sense as there's no
>> way to reliably support the magic encoding comment unless we can =20
>> count
>> on being able to read at least that far.
>
> Needed to say that XML parsers can handle such cases, i.e. when xml
> header is in different encoding than the rest of document.

I doubt we can say that universally.  :)

Also, what you said isn't very accurate.  For example, "in different =20
encoding than the rest of document" is not a possible occurrence =20
according to the XML 1.1 specification =
(http://www.w3.org/TR/2006/REC-xml11-20060816/=20
) which states:

"It is a fatal error when an XML processor encounters an entity with =20
an encoding that it is unable to process. It is a fatal error if an =20
XML entity is determined (via default, encoding declaration, or higher-=20=

level protocol) to be in a certain encoding but contains byte =20
sequences that are not legal in that encoding."

All XML parsers are required to assume UTF-8 unless told otherwise and =20=

to be able to recognize UTF-16 by a required BOM.  Beyond that, they =20
are not required to recognize any other encodings, though they may of =20=

course.  Their encoding declaration can be expressed in ASCII and, =20
since they assume UTF-8 by default, this is similar to what Ruby =20
does.  It allows a switch to an ASCII-compatible encoding.

XML processors may do more.  For example, they can accept a different =20=

encoding from an external source to support things like HTTP headers =20
and MIME types.  Ruby doesn't really have access to such sources at =20
execution time, so that option doesn't apply to the case we are =20
discussing.  However, XML processors may also recognize other BOM's =20
and Ruby could do this.

>> A BOM could be handled similarly to what I showed.  You need to open
>> the file in ASCII-8BIT and check the beginning bytes, then you could
>> switch to US-ASCII and finish reading the first line (or to the =20
>> second
>> if a shebang line is includes), then switch encodings again if needed
>> and finish processing.
>
> May be this technique could be used for reading UTF-16 encoded =20
> files, if
> needed?

Yes, Ruby could recognize BOM's for non-ASCII compatible encodings to =20=

support them.  A BOM would be required in this case though, just as it =20=

is in an XML processor that doesn't have external information.

Ruby doesn't currently do this, as near as I can tell.

Note that this would not give what you purposed in your initial =20
message:  multiple encodings in the same file.  Ruby doesn't support =20
that and isn't ever likely to.  An XML processor that supports such =20
things is in violation of its specification as I understand it.

Besides, not many text editors that I'm aware of make it super easy to =20=

edit in multiple encodings.  :)

James Edward Gray II