Hello Charles,

On 2012/11/06 11:51, headius (Charles Nutter) wrote:
>
> Issue #7282 has been reported by headius (Charles Nutter).
>
> ----------------------------------------
> Bug #7282: Invalid UTF-8 from emoji allowed through silently
> https://bugs.ruby-lang.org/issues/7282
>
> Author: headius (Charles Nutter)
> Status: Open
> Priority: Normal
> Assignee:
> Category:
> Target version:
> ruby -v: 2.0.0
>
>
> On my system, where the default encoding is UTF-8, the following should not parse:
>
> ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

It doesn't. It should be
ruby-2.0.0 -e 'p "Hello, \x96 world!"}'
or
ruby-2.0.0 -e 'p "Hello, \x96 world!\"}"'
or
ruby-2.0.0 -e 'p "Hello, \x96 world!"'
or some such. But apart from that, you are right.

I'm no longer sure, but I think at some point, there was an argument to 
allow \x in UTF-8 literals, and a reason to not check. But I can't 
remember what, and if we can't remember, when we'd better make it check.

> But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem:

> system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")'
> "{\"sample\": \"Hello, \x96 world!\"}"

Encoding to the encoding you're already in is a no-op. See also 
https://bugs.ruby-lang.org/issues/6321.

> Nor does character-walking:

> system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}'
> Hello, ? world!
>
> Nor does []:

> system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
> "\x96"

The underlying machinery is the same.

> But the malformed String does get caught by transcoding to UTF-16:

> system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")'
> -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
> 	from -e:1:in `<main>'

Yes, here you're actually transcoding, so this is checked.


> Or by doing a simple regexp match:

> system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/'
> -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
> 	from -e:1:in `match'
> 	from -e:1:in `<main>'

We'd need to dig in the code to figure out why it happens here.

> And of course I am ignoring the fact that it should never have parsed to begin with.
>
> This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence.
>
> JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.

Overall, the idea (I think) is to hit a balance between efficiency and 
correctness. But checking at parsing time would probably be rather 
efficient at avoiding errors.

Regards,    Martin.