Wolfgang Nádasi-Donner schrieb:
> I think it's solved.
Unfortunately not completely, but let me show details and
a possible solution.
1) \C-x and \cx applied to an character outside 00..7F
will result in an error now - that's fine...
>ruby19 -v
ruby 1.9.0 (2007-11-19 patchlevel 0) [i386-mingw32]
t = "ãâ¥ã\C-ãâ¥ã"
>ruby19 utf8manipul2.rb
utf8manipul2.rb:1: Invalid escape character syntax
¥Ä¥¨ãõÎäÇðt = "¥ÆÈÅ¥¥¥Ä¥·\C-¥ÆÈÅ¥¥¥Ä¥·"
^
The output looks strange, but its a Windows console.
I'm a little bit surprized, that the BOM is inside
the error message, but it doesn't matter.
=======================================================
2) \M-c applied to a character in range 00..7F is still
allowed and produces the same problem as before...
>ruby19 -v
ruby 1.9.0 (2007-11-19 patchlevel 0) [i386-mingw32]
t = "a\M-aa"
puts t.encoding # => <Encoding:UTF-8>
puts t.length # => 2
puts t.bytesize # => 3
t.each_byte{|b|print("%X " %b)} # => 61 E1 61
puts
t.each_char{|c|print("%X" % c.ord)}
# => utf8manipul.rb:7:in `each_char': index out of range (IndexError)
# =>from utf8manipul.rb:7:in `<main>'
# => 61
=======================================================
I'm not surprized about this, because "a" is in range 00..7F,
but "\M-a" generates a ill-formed utf-8 encoding.
If the string will have the encoding utf-8 (the File starts
with a BOM, so Ruby expects utf-8), the resulting codepount
0xE1 must be encoded to the two bytes 0xC3 0xA1, which would
be the well-formed utf-8 encoding for "\M-a".
In general this would lead to the calculation...
first_byte = ((char.ord & 0x40) / 64) | 0xC2
second_byte = (char.ord & 0x3F) | 0x80
..., if I don't make an error with the bits.
BUT - this should only occur if the encoding of the generated
String object is "<Encoding:UTF-8>", otherwise it doesn't make
any sense.
I think the best way to solve this is not to allow "\M-."
constructs, if the resulting String object doesn't have an
encoding "<Encoding:ASCII-8BIT>", "<Encoding:ISO-8859-1>", or
"Binary", which isn't in the list of encodings in the actual
snapshot.
I hope I will bring better news tomorrow ;-)
Wolfgang Nádasi-Donner