Issue #7282 has been updated by headius (Charles Nutter).


duerst (Martin Dürst) wrote:
>  > On my system, where the default encoding is UTF-8, the following should not parse:
>  >
>  > ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'
>  
>  It doesn't. It should be
>  ruby-2.0.0 -e 'p "Hello, \x96 world!"}'
>  or
>  ruby-2.0.0 -e 'p "Hello, \x96 world!\"}"'
>  or
>  ruby-2.0.0 -e 'p "Hello, \x96 world!"'
>  or some such. But apart from that, you are right.

Yeah sorry...I guess I was rushed filing this issue. The last one is what I was going for.

>  I'm no longer sure, but I think at some point, there was an argument to 
>  allow \x in UTF-8 literals, and a reason to not check. But I can't 
>  remember what, and if we can't remember, when we'd better make it check.

Yes, it seems like either this string should be forced to ASCII-8BIT, or else it shouldn't be allowed to parse in the first place. It definitely should not parse *and* be marked as valid UTF-8.

>  > But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem:
>  
>  > system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")'
>  > "{\"sample\": \"Hello, \x96 world!\"}"
>  
>  Encoding to the encoding you're already in is a no-op. See also 
>  https://bugs.ruby-lang.org/issues/6321.

Thank you. I suspected as much and will make changes to JRuby (and RubySpec if needed). JRuby was always doing the transcoding, so it blew up here attempting to walk UTF-8 characters.

>  > Nor does character-walking:
>  
>  > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}'
>  > Hello, ? world!
>  >
>  > Nor does []:
>  
>  > system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
>  > "\x96"
>  
>  The underlying machinery is the same.

Makes sense. JRuby also allows these cases through. Perhaps both cases should fail once they encounter a non-7bit, non-surrogate byte like \x96?

>  > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/'
>  > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
>  > 	from -e:1:in `match'
>  > 	from -e:1:in `<main>'
>  
>  We'd need to dig in the code to figure out why it happens here.

Well, at the very least it would have to be using the encoding subsystem for Oniguruma/Onigmo to walk characters; that logic almost certainly rejects \x96.
----------------------------------------
Bug #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-32503

Author: headius (Charles Nutter)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: 2.0.0
ruby -v: 2.0.0


On my system, where the default encoding is UTF-8, the following should not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
	from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
	from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `match'
	from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
	from -e:1:in `match'
	from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.


-- 
http://bugs.ruby-lang.org/