Issue #7282 has been reported by headius (Charles Nutter). ---------------------------------------- Bug #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282 Author: headius (Charles Nutter) Status: Open Priority: Normal Assignee: Category: Target version: ruby -v: 2.0.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
on 2012-11-06 03:52
on 2012-11-06 06:58
Issue #7282 has been updated by usa (Usaku NAKAMURA). Category set to M17N Status changed from Open to Assigned Assignee set to naruse (Yui NARUSE) Target version set to 2.0.0 ---------------------------------------- Bug #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-32469 Author: headius (Charles Nutter) Status: Assigned Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: 2.0.0 ruby -v: 2.0.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
on 2012-11-06 07:07
Hello Charles, On 2012/11/06 11:51, headius (Charles Nutter) wrote: > Assignee: > Category: > Target version: > ruby -v: 2.0.0 > > > On my system, where the default encoding is UTF-8, the following should not parse: > > ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' It doesn't. It should be ruby-2.0.0 -e 'p "Hello, \x96 world!"}' or ruby-2.0.0 -e 'p "Hello, \x96 world!\"}"' or ruby-2.0.0 -e 'p "Hello, \x96 world!"' or some such. But apart from that, you are right. I'm no longer sure, but I think at some point, there was an argument to allow \x in UTF-8 literals, and a reason to not check. But I can't remember what, and if we can't remember, when we'd better make it check. > But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: > system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' > "{\"sample\": \"Hello, \x96 world!\"}" Encoding to the encoding you're already in is a no-op. See also https://bugs.ruby-lang.org/issues/6321. > Nor does character-walking: > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' > Hello, ? world! > > Nor does []: > system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' > "\x96" The underlying machinery is the same. > But the malformed String does get caught by transcoding to UTF-16: > system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' > -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) > from -e:1:in `<main>' Yes, here you're actually transcoding, so this is checked. > Or by doing a simple regexp match: > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) > from -e:1:in `match' > from -e:1:in `<main>' We'd need to dig in the code to figure out why it happens here. > And of course I am ignoring the fact that it should never have parsed to begin with. > > This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. > > JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed. Overall, the idea (I think) is to hit a balance between efficiency and correctness. But checking at parsing time would probably be rather efficient at avoiding errors. Regards, Martin.
on 2012-11-06 17:40
Issue #7282 has been updated by headius (Charles Nutter). duerst (Martin Dürst) wrote: > or some such. But apart from that, you are right. Yeah sorry...I guess I was rushed filing this issue. The last one is what I was going for. > I'm no longer sure, but I think at some point, there was an argument to > allow \x in UTF-8 literals, and a reason to not check. But I can't > remember what, and if we can't remember, when we'd better make it check. Yes, it seems like either this string should be forced to ASCII-8BIT, or else it shouldn't be allowed to parse in the first place. It definitely should not parse *and* be marked as valid UTF-8. > > But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: > > > system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' > > "{\"sample\": \"Hello, \x96 world!\"}" > > Encoding to the encoding you're already in is a no-op. See also > https://bugs.ruby-lang.org/issues/6321. Thank you. I suspected as much and will make changes to JRuby (and RubySpec if needed). JRuby was always doing the transcoding, so it blew up here attempting to walk UTF-8 characters. > The underlying machinery is the same. Makes sense. JRuby also allows these cases through. Perhaps both cases should fail once they encounter a non-7bit, non-surrogate byte like \x96? > > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' > > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) > > from -e:1:in `match' > > from -e:1:in `<main>' > > We'd need to dig in the code to figure out why it happens here. Well, at the very least it would have to be using the encoding subsystem for Oniguruma/Onigmo to walk characters; that logic almost certainly rejects \x96. ---------------------------------------- Bug #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-32503 Author: headius (Charles Nutter) Status: Assigned Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: 2.0.0 ruby -v: 2.0.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
on 2013-02-18 12:54
Issue #7282 has been updated by naruse (Yui NARUSE). Tracker changed from Bug to RubySpec headius (Charles Nutter) wrote: > > > > The underlying machinery is the same. > > Makes sense. JRuby also allows these cases through. Perhaps both cases should fail once they encounter a non-7bit, non-surrogate byte like \x96? On string index access, Ruby doesn't raise error even if it is invalid byte sequence. > > > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' > > > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) > > > from -e:1:in `match' > > > from -e:1:in `<main>' > > > > We'd need to dig in the code to figure out why it happens here. > > Well, at the very least it would have to be using the encoding subsystem for Oniguruma/Onigmo to walk characters; that logic almost certainly rejects \x96. On regexp match, Ruby raises error. ---------------------------------------- RubySpec #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-36498 Author: headius (Charles Nutter) Status: Assigned Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: 2.0.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
on 2013-02-24 13:23
Issue #7282 has been updated by ko1 (Koichi Sasada). Target version changed from 2.0.0 to 2.1.0 naruse-san, what is the status of this ticket? ---------------------------------------- RubySpec #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-36907 Author: headius (Charles Nutter) Status: Assigned Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: 2.1.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
on 2013-02-24 13:28
Issue #7282 has been updated by naruse (Yui NARUSE). ko1 (Koichi Sasada) wrote: > naruse-san, what is the status of this ticket? I don't understand what is the current problem of this ticket. If headius has some issue, could you summarize it? Or nothing, close this. ---------------------------------------- RubySpec #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-36916 Author: headius (Charles Nutter) Status: Assigned Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: 2.1.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
on 2013-03-09 09:11
Issue #7282 has been updated by headius (Charles Nutter). A couple quick tests seem to work ok in 2.0.0. If all my original cases from the report work properly (i.e. fail properly) then this one is fixed. I have not confirmed all scenarios yet. ---------------------------------------- RubySpec #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-37415 Author: headius (Charles Nutter) Status: Assigned Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: current: 2.1.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
on 2013-03-13 03:16
Issue #7282 has been updated by naruse (Yui NARUSE). Status changed from Assigned to Closed ---------------------------------------- RubySpec #7282: Invalid UTF-8 from emoji allowed through silently https://bugs.ruby-lang.org/issues/7282#change-37554 Author: headius (Charles Nutter) Status: Closed Priority: Normal Assignee: naruse (Yui NARUSE) Category: M17N Target version: current: 2.1.0 On my system, where the default encoding is UTF-8, the following should not parse: ruby-2.0.0 -e 'p "Hello, \x96 world!\"}' But it does. And it is apparently marked as "ok" as far as code range goes, because encoding to UTF-8 does not catch the problem: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")' "{\"sample\": \"Hello, \x96 world!\"}" Nor does character-walking: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| print x}' Hello, ? world! Nor does []: system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]' " " system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]' "\x96" system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]' " " But the malformed String does get caught by transcoding to UTF-16: system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")' -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError) from -e:1:in `<main>' Or by doing a simple regexp match: system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/' -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError) from -e:1:in `match' from -e:1:in `<main>' And of course I am ignoring the fact that it should never have parsed to begin with. This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot of confidence. JRuby allows it through the parser (this is a bug) but does fail in other places because the string is malformed.
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.