Forum: Ruby-core [ruby-trunk - Bug #7566][Open] Escape (\u{}) forms in Regexp literals

Posted by Brian Ford (brixen)
on 2012-12-15 02:11
(Received via mailing list)
Issue #7566 has been reported by brixen (Brian Ford).

----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566

Author: brixen (Brian Ford)
Status: Open
Priority: Normal
Assignee:
Category:
Target version:
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Posted by drbrain (Eric Hodel) (Guest)
on 2012-12-15 02:54
(Received via mailing list)
Issue #7566 has been updated by drbrain (Eric Hodel).

Category set to core
Target version set to 2.0.0

=begin
Converting any of the regexp special characters could cause a syntax 
error or warning if the user tries to round-trip the regexp, so I think 
this is not a bug:

  $ ruby20 -ve 'p("\u{5d}", /[\u{5d}]/)'
  ruby 2.0.0dev (2012-12-15 trunk 38385) [x86_64-darwin12.2.1]
  "]"
  /[\u{5d}]/

  $ ruby20 -ve 'p(/[]]/)'
  ruby 2.0.0dev (2012-12-15 trunk 38385) [x86_64-darwin12.2.1]
  -e:1: warning: character class has ']' without escape: /[]]/
  /[]]/

=end
----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566#change-34757

Author: brixen (Brian Ford)
Status: Open
Priority: Normal
Assignee:
Category: core
Target version: 2.0.0
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Posted by Brian Ford (brixen)
on 2012-12-15 19:14
(Received via mailing list)
Issue #7566 has been updated by brixen (Brian Ford).


I'd argue that's a malformed Regexp and "round-tripping" shouldn't be 
expected to work.

sasha:rubinius brian$ irb
1.9.3p327 :001 > re = /[\\\u{5d}]/
 => /[\\\u{5d}]/
1.9.3p327 :002 > re2 = Regexp.new re
 => /[\\\u{5d}]/
1.9.3p327 :003 > re3 = Regexp.new re.source
 => /[\\\u{5d}]/
1.9.3p327 :004 > "ab]c" =~ re
 => 2
1.9.3p327 :005 > "ab]c" =~ re2
 => 2
1.9.3p327 :006 > "ab]c" =~ re3
 => 2

The consequence of storing the source with escape sequences and the fact 
that 7-bit clean source even using UTF escapes is encoded as US-ASCII is 
that the underlying Oniguruma data must be maintained separately and the 
string potentially unescaped every match. At least, that is the best 
understanding I have of the MRI source code. AFAIK, this is not defined 
anywhere.

Thanks,
Brian
----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566#change-34772

Author: brixen (Brian Ford)
Status: Open
Priority: Normal
Assignee:
Category: core
Target version: 2.0.0
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Posted by naruse (Yui NARUSE) (Guest)
on 2012-12-17 03:15
(Received via mailing list)
Issue #7566 has been updated by naruse (Yui NARUSE).

Status changed from Open to Rejected

Because Regexp Literals are not String Literals, and escapes in them 
have different meanings.
For example \b, it is word boundary in Regexp but BEL in String.
People will need to distingish word boundary from BEL, so \b must be 
showed as \b.
\uXXXX follows such style.
----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566#change-34784

Author: brixen (Brian Ford)
Status: Rejected
Priority: Normal
Assignee:
Category: core
Target version: 2.0.0
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Posted by Brian Ford (brixen)
on 2012-12-17 03:40
(Received via mailing list)
Issue #7566 has been updated by brixen (Brian Ford).


Are you saying you can represent \b as a \u{} escape sequence in a 
Regexp?
----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566#change-34785

Author: brixen (Brian Ford)
Status: Rejected
Priority: Normal
Assignee:
Category: core
Target version: 2.0.0
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Posted by naruse (Yui NARUSE) (Guest)
on 2012-12-17 03:50
(Received via mailing list)
Issue #7566 has been updated by naruse (Yui NARUSE).


brixen (Brian Ford) wrote:
> Are you saying you can represent \b as a \u{} escape sequence in a Regexp?

No.
(1) \b (word boundary), \s (spaces and tabs) and so on are can't 
expressed as bytes
(2) so escapes are not converted to bytes, kept as is
(3) \u{} is also escape, so kept as is
----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566#change-34787

Author: brixen (Brian Ford)
Status: Rejected
Priority: Normal
Assignee:
Category: core
Target version: 2.0.0
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Posted by Brian Ford (brixen)
on 2013-01-02 19:38
(Received via mailing list)
Issue #7566 has been updated by brixen (Brian Ford).


But as my example shows, if the bytes were in a literal String used to 
create the Regexp, they are already converted. And everything works just 
fine.

What's the rationale for not converting \u{}? Just because it is *an* 
escape sequence doesn't mean it is a *Regexp* escape sequence. Why are 
they treated the same? It creates inconsistency between two identical 
Regexps except that one came from a String or Regexp literal with 
interpolation.
----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566#change-35182

Author: brixen (Brian Ford)
Status: Rejected
Priority: Normal
Assignee:
Category: core
Target version: 2.0.0
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Posted by Matthew Kerwin (mattyk)
on 2013-01-02 21:42
(Received via mailing list)
Issue #7566 has been updated by phluid61 (Matthew Kerwin).


brixen (Brian Ford) wrote:
> But as my example shows, if the bytes were in a literal String used to create 
the Regexp, they are already converted. And everything works just fine.

No it doesn't.  There are no literal strings in your example.  The 
closest I can see is you extracting a source string from the Regexp, but 
I don't think that's doing what you think it is.

  irb(main):001:0> re = /[\\\u{5d}]/
  => /[\\\u{5d}]/
  irb(main):002:0> re.source
  => "[\\\\\\u{5d}]"

If you meant this:

  irb(main):003:0> s = "[\\\u{5d}]"
  => "[\\]]"
  irb(main):004:0> re2 = Regexp.new s
  => /[\]]/

You get an entirely different Regexp.  They will both match the string 
"ab]c" because they both include the ']' character in their character 
class. Incidentally:

irb(main):005:0> re =~ "ab\\c"
=> 2
irb(main):006:0> re2 =~ "ab\\c"
=> nil

> What's the rationale for not converting \u{}? Just because it is *an* escape 
sequence doesn't mean it is a *Regexp* escape sequence. Why are they treated the 
same?

They aren't.  If it helps, consider that _no_ Regexp escape sequences 
are treated the same as String escapes.

  \\ is a String literal escape sequence that is interpolated to the 
byte \x5C
  \\ is a Regexp literal escape sequence that instructs the engine to 
match the byte \x5C

  \u{} is a String literal escape sequence that is interpolated to a 
codepoint
  \u{} is a Regexp literal escape sequence that instructs the engine to 
match a codepoint

  \b is a String literal that is interpolated to the byte \x08
  \b is a Regexp literal that instructs the engine to match a word 
boundary

> It creates inconsistency between two identical Regexps except that one came from 
a String or Regexp literal with interpolation.

No, if the Regexps were identical they would be identical.  As you can 
see above, re and re2 are not identical, and no one should expect them 
to be.
----------------------------------------
Bug #7566: Escape (\u{}) forms in Regexp literals
https://bugs.ruby-lang.org/issues/7566#change-35183

Author: brixen (Brian Ford)
Status: Rejected
Priority: Normal
Assignee:
Category: core
Target version: 2.0.0
ruby -v: ruby 1.9.3p327 (2012-11-10 revision 37606) 
[x86_64-darwin10.8.0]


Why are \u{} escape sequences in Regexp literals not converted to bytes 
like they are in String literals?

https://gist.github.com/4290155

Thanks,
Brian
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.