Wolfgang Nádasi-Donner schrieb:
> I think it's solved.

Unfortunately not completely, but let me show details and
a possible solution.

1) \C-x and \cx applied to an character outside 00..7F
    will result in an error now - that's fine...

    >ruby19 -v
    ruby 1.9.0 (2007-11-19 patchlevel 0) [i386-mingw32]

    t = "ãâ¥ã\C-ãâ¥ã"

    >ruby19 utf8manipul2.rb
    utf8manipul2.rb:1: Invalid escape character syntax
    ¥Ä¥¨ãõÎäÇðt = "¥ÆÈÅ¥¥¥Ä¥·\C-¥ÆÈÅ¥¥¥Ä¥·"
                ^

    The output looks strange, but its a Windows console.
    I'm a little bit surprized, that the BOM is inside
    the error message, but it doesn't matter.

=======================================================

2) \M-c applied to a character in range 00..7F is still
    allowed and produces the same problem as before...

    >ruby19 -v
    ruby 1.9.0 (2007-11-19 patchlevel 0) [i386-mingw32]

    t = "a\M-aa"
    puts t.encoding                     # => <Encoding:UTF-8>
    puts t.length                       # => 2
    puts t.bytesize                     # => 3
    t.each_byte{|b|print("%X " %b)}     # => 61 E1 61
    puts
    t.each_char{|c|print("%X" % c.ord)}
    # => utf8manipul.rb:7:in `each_char': index out of range (IndexError)
    # =>from utf8manipul.rb:7:in `<main>'
    # => 61

=======================================================

I'm not surprized about this, because "a" is in range 00..7F,
but "\M-a" generates a ill-formed utf-8 encoding.

If the string will have the encoding utf-8 (the File starts
with a BOM, so Ruby expects utf-8), the resulting codepount
0xE1 must be encoded to the two bytes 0xC3 0xA1, which would
be the well-formed utf-8 encoding for "\M-a".

In general this would lead to the calculation...

first_byte  = ((char.ord & 0x40) / 64) | 0xC2
second_byte = (char.ord & 0x3F) | 0x80

..., if I don't make an error with the bits.

BUT - this should only occur if the encoding of the generated
String object is "<Encoding:UTF-8>", otherwise it doesn't make
any sense.

I think the best way to solve this is not to allow "\M-."
constructs, if the resulting String object doesn't have an
encoding "<Encoding:ASCII-8BIT>", "<Encoding:ISO-8859-1>", or
"Binary", which isn't in the list of encodings in the actual
snapshot.

I hope I will bring better news tomorrow ;-)

Wolfgang Nádasi-Donner