Forum: Ruby-core [ruby-trunk - Bug #7282][Open] Invalid UTF-8 from emoji allowed through silently

Posted by Charles Nutter (headius)
on 2012-11-06 03:52
(Received via mailing list)
Issue #7282 has been reported by headius (Charles Nutter).

----------------------------------------
Bug #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282

Author: headius (Charles Nutter)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: 2.0.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Posted by usa (Usaku NAKAMURA) (Guest)
on 2012-11-06 06:58
(Received via mailing list)
Issue #7282 has been updated by usa (Usaku NAKAMURA).

Category set to M17N
Status changed from Open to Assigned
Assignee set to naruse (Yui NARUSE)
Target version set to 2.0.0


----------------------------------------
Bug #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-32469

Author: headius (Charles Nutter)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: 2.0.0
ruby -v: 2.0.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Posted by "Martin J. Dürst" <duerst@it.aoyama.ac.jp> (Guest)
on 2012-11-06 07:07
(Received via mailing list)
Hello Charles,

On 2012/11/06 11:51, headius (Charles Nutter) wrote:
> Assignee:
> Category:
> Target version:
> ruby -v: 2.0.0
>
>
> On my system, where the default encoding is UTF-8, the following should not 
parse:
>
> ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

It doesn't. It should be
ruby-2.0.0 -e 'p "Hello, \x96 world!"}'
or
ruby-2.0.0 -e 'p "Hello, \x96 world!\"}"'
or
ruby-2.0.0 -e 'p "Hello, \x96 world!"'
or some such. But apart from that, you are right.

I'm no longer sure, but I think at some point, there was an argument to
allow \x in UTF-8 literals, and a reason to not check. But I can't
remember what, and if we can't remember, when we'd better make it check.

> But it does. And it is apparently marked as "ok" as far as code range goes, 
because encoding to UTF-8 does not catch the problem:

> system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
> "{\"sample\": \"Hello, \x96 world!\"}"

Encoding to the encoding you're already in is a no-op. See also
https://bugs.ruby-lang.org/issues/6321.

> Nor does character-walking:

> system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char {|x| 
print x}'
> Hello, ? world!
>
> Nor does []:

> system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
> "\x96"

The underlying machinery is the same.

> But the malformed String does get caught by transcoding to UTF-16:

> system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
> -e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
>   from -e:1:in `<main>'

Yes, here you're actually transcoding, so this is checked.


> Or by doing a simple regexp match:

> system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/'
> -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
>   from -e:1:in `match'
>   from -e:1:in `<main>'

We'd need to dig in the code to figure out why it happens here.

> And of course I am ignoring the fact that it should never have parsed to begin 
with.
>
> This kind of inconsistency in rejecting malformed UTF-8 does not inspire a lot 
of confidence.
>
> JRuby allows it through the parser (this is a bug) but does fail in other places 
because the string is malformed.

Overall, the idea (I think) is to hit a balance between efficiency and
correctness. But checking at parsing time would probably be rather
efficient at avoiding errors.

Regards,    Martin.
Posted by Charles Nutter (headius)
on 2012-11-06 17:40
(Received via mailing list)
Issue #7282 has been updated by headius (Charles Nutter).


duerst (Martin Dürst) wrote:
>  or some such. But apart from that, you are right.
Yeah sorry...I guess I was rushed filing this issue. The last one is 
what I was going for.

>  I'm no longer sure, but I think at some point, there was an argument to
>  allow \x in UTF-8 literals, and a reason to not check. But I can't
>  remember what, and if we can't remember, when we'd better make it check.

Yes, it seems like either this string should be forced to ASCII-8BIT, or 
else it shouldn't be allowed to parse in the first place. It definitely 
should not parse *and* be marked as valid UTF-8.

>  > But it does. And it is apparently marked as "ok" as far as code range goes, 
because encoding to UTF-8 does not catch the problem:
>
>  > system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
>  > "{\"sample\": \"Hello, \x96 world!\"}"
>
>  Encoding to the encoding you're already in is a no-op. See also
>  https://bugs.ruby-lang.org/issues/6321.

Thank you. I suspected as much and will make changes to JRuby (and 
RubySpec if needed). JRuby was always doing the transcoding, so it blew 
up here attempting to walk UTF-8 characters.

>  The underlying machinery is the same.
Makes sense. JRuby also allows these cases through. Perhaps both cases 
should fail once they encounter a non-7bit, non-surrogate byte like 
\x96?

>  > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/'
>  > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
>  >   from -e:1:in `match'
>  >   from -e:1:in `<main>'
>
>  We'd need to dig in the code to figure out why it happens here.

Well, at the very least it would have to be using the encoding subsystem 
for Oniguruma/Onigmo to walk characters; that logic almost certainly 
rejects \x96.
----------------------------------------
Bug #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-32503

Author: headius (Charles Nutter)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: 2.0.0
ruby -v: 2.0.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Posted by naruse (Yui NARUSE) (Guest)
on 2013-02-18 12:54
(Received via mailing list)
Issue #7282 has been updated by naruse (Yui NARUSE).

Tracker changed from Bug to RubySpec

headius (Charles Nutter) wrote:
> >
> >  The underlying machinery is the same.
>
> Makes sense. JRuby also allows these cases through. Perhaps both cases should 
fail once they encounter a non-7bit, non-surrogate byte like \x96?

On string index access, Ruby doesn't raise error even if it is invalid 
byte sequence.

> >  > system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match /.+/'
> >  > -e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
> >  >   from -e:1:in `match'
> >  >   from -e:1:in `<main>'
> >
> >  We'd need to dig in the code to figure out why it happens here.
>
> Well, at the very least it would have to be using the encoding subsystem for 
Oniguruma/Onigmo to walk characters; that logic almost certainly rejects \x96.

On regexp match, Ruby raises error.
----------------------------------------
RubySpec #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-36498

Author: headius (Charles Nutter)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: 2.0.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Posted by ko1 (Koichi Sasada) (Guest)
on 2013-02-24 13:23
(Received via mailing list)
Issue #7282 has been updated by ko1 (Koichi Sasada).

Target version changed from 2.0.0 to 2.1.0

naruse-san, what is the status of this ticket?

----------------------------------------
RubySpec #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-36907

Author: headius (Charles Nutter)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: 2.1.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Posted by naruse (Yui NARUSE) (Guest)
on 2013-02-24 13:28
(Received via mailing list)
Issue #7282 has been updated by naruse (Yui NARUSE).


ko1 (Koichi Sasada) wrote:
> naruse-san, what is the status of this ticket?

I don't understand what is the current problem of this ticket.
If headius has some issue, could you summarize it?
Or nothing, close this.
----------------------------------------
RubySpec #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-36916

Author: headius (Charles Nutter)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: 2.1.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Posted by Charles Nutter (headius)
on 2013-03-09 09:11
(Received via mailing list)
Issue #7282 has been updated by headius (Charles Nutter).


A couple quick tests seem to work ok in 2.0.0. If all my original cases 
from the report work properly (i.e. fail properly) then this one is 
fixed. I have not confirmed all scenarios yet.
----------------------------------------
RubySpec #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-37415

Author: headius (Charles Nutter)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: current: 2.1.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Posted by naruse (Yui NARUSE) (Guest)
on 2013-03-13 03:16
(Received via mailing list)
Issue #7282 has been updated by naruse (Yui NARUSE).

Status changed from Assigned to Closed


----------------------------------------
RubySpec #7282: Invalid UTF-8 from emoji allowed through silently
https://bugs.ruby-lang.org/issues/7282#change-37554

Author: headius (Charles Nutter)
Status: Closed
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: M17N
Target version: current: 2.1.0


On my system, where the default encoding is UTF-8, the following should 
not parse:

ruby-2.0.0 -e 'p "Hello, \x96 world!\"}'

But it does. And it is apparently marked as "ok" as far as code range 
goes, because encoding to UTF-8 does not catch the problem:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

Nor does character-walking:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!
system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".each_char 
{|x| print x}'
Hello, ? world!

Nor does []:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "Hello, \x96 world!"[8]'
" "

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[7]'
"\x96"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "Hello, \x96 world!"[8]'
" "

But the malformed String does get caught by transcoding to UTF-16:

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 
world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
  from -e:1:in `<main>'

Or by doing a simple regexp match:

system ~/projects/jruby $ ruby-1.9.3 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e '"Hello, \x96 world!".match 
/.+/'
-e:1:in `match': invalid byte sequence in UTF-8 (ArgumentError)
  from -e:1:in `match'
  from -e:1:in `<main>'

And of course I am ignoring the fact that it should never have parsed to 
begin with.

This kind of inconsistency in rejecting malformed UTF-8 does not inspire 
a lot of confidence.

JRuby allows it through the parser (this is a bug) but does fail in 
other places because the string is malformed.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.