Bug in gsub(?)

giuan · September 25, 2010, 8:29pm

I have found this bug(?) in gsub

puts “\:{}=#~”.gsub(/([\:~=#{}])/, ‘\ \1’)
=> \ \ :\ {\ }\ =\ #\ ~ OK

but

puts “\:{}=#~”.gsub(/([\:~=#{}])/, ‘\\1’)
=> \1\1\1\1\1\1\1

Any idea?

giuan · September 26, 2010, 12:08am

Tiziano M. wrote:

I have found this bug(?) in gsub

http://www.catb.org/~esr/faqs/smart-questions.html#id382249

puts “\:{}=#~”.gsub(/([\:~=#{}])/, ‘\ \1’)
=> \ \ :\ {\ }\ =\ #\ ~ OK

but

puts “\:{}=#~”.gsub(/([\:~=#{}])/, ‘\\1’)
=> \1\1\1\1\1\1\1

Any idea?

puts “a”.gsub(/a/, ‘\\’) # i.e. two backslashes
=> \

That is, in a replacement string, if you backslash-escape a backslash
you get a single backslash. That allows you to have literally \1 if
that’s what you need.

So a literal backslash is \, and the first capture is \1

So what you want is \\1, to get a backslash followed by the first
capture. However, that is represented in a string literal as ‘\\\1’
(which generates a 4 character string) because a string literal also has
backslash escaping.

‘\\\1’.size
=> 4
puts “\:{}=#~”.gsub(/([\:~=#{}])/, ‘\\\1’)
\:{}=#~
=> nil

Take a suggestion from me: save your sanity and use the block form
instead

puts “\:{}=#~”.gsub(/([\:~=#{}])/) { “\#{$1}” }
\:{}=#~
=> nil

giuan · September 26, 2010, 8:54am

Chad P. wrote:

I’ve wondered for quite a while what was the rationale for having \1 in
the first place.

Ruby inherits a lot from Perl, and Perl from sed.

Some of the Perlisms are IMO superfluous - in particular the Kernel
methods which operate on $_, and the flip-flop conditional operators.

Objects would be much tidier if they didn’t inherit Kernel#gets,
Kernel#gsub etc; and you’d avoid some confusing error messages like

irb(main):001:0> 3.gsub(/a/,‘b’)
NoMethodError: private method `gsub’ called for 3:Fixnum

giuan · September 26, 2010, 1:34am

On Sun, Sep 26, 2010 at 07:08:13AM +0900, Brian C. wrote:

Take a suggestion from me: save your sanity and use the block form
instead

puts “\:{}=#~”.gsub(/([\:~=#{}])/) { “\#{$1}” }
\:{}=#~
=> nil

I’ve wondered for quite a while what was the rationale for having \1 in
the first place.

giuan · September 26, 2010, 8:57am

Brian C. wrote:

That is, in a replacement string, if you backslash-escape a backslash
you get a single backslash. That allows you to have literally \1 if
that’s what you need.

So a literal backslash is \, and the first capture is \1

So what you want is \\1, to get a backslash followed by the first
capture. However, that is represented in a string literal as ‘\\\1’
(which generates a 4 character string) because a string literal also has
backslash escaping.

‘\\\1’.size
=> 4

puts “\:{}=#~”.gsub(/([\:~=#{}])/, ‘\\\1’)
\:{}=#~
=> nil

Take a suggestion from me: save your sanity and use the block form
instead

puts “\:{}=#~”.gsub(/([\:~=#{}])/) { “\#{$1}” }
\:{}=#~
=> nil

ThanksBrian!
I know the block form.
So the problem is the backslash escape in string:
‘\\1’ == ‘\\1’ => true

giuan · September 27, 2010, 3:40am

On Sun, Sep 26, 2010 at 03:54:35PM +0900, Brian C. wrote:

Chad P. wrote:

I’ve wondered for quite a while what was the rationale for having \1 in
the first place.

Ruby inherits a lot from Perl, and Perl from sed.

Okay . . . I guess that sorta makes sense. Of course, I’ve never used
\1
in Perl, nor seen anyone else do so either, so until you mentioned it I
had entirely forgotten that was an option there either.

Both languages would be better off without that syntax, and just stick
with $1 instead, I think.

Some of the Perlisms are IMO superfluous - in particular the Kernel
methods which operate on $_, and the flip-flop conditional operators.

I wouldn’t really call \1 a “Perlism”, given that the way I’ve always
seen it done is with $1 instead. If it’s a Perlism despite its lack of
general usage, I’d say it’s every bit as much a Rubyism.

giuan · September 27, 2010, 4:04am

On Sep 26, 2010, at 9:40 PM, Chad P. wrote:

general usage, I’d say it’s every bit as much a Rubyism.
There are times in Perl when you need to use \1 in the matching part of
a regular expression because you don’t want $1 to interpolate into the
match.

Consider trying to match a simple quoted string (i.e. no \ escaping):

my $s1 = “Hello there”;
my $s2 = q{The cat said “Hello there, how’s it going?”};

if ($s1 =~ m/(ell)/) {
print “print s1 matched - $1 is ‘$1’\n”;
}

if ($s2 =~ m/(["'])(.*?)\1/) {
print “print s2 matched - $2 is ‘$2’\n”;
}

This outputs:

print s1 matched - $1 is ‘ell’
print s2 matched - $2 is ‘Hello there, how’s it going?’

If you try using $1 in place of \1 in the second regex then it will
output

print s1 matched - $1 is ‘ell’
print s2 matched - $2 is ‘H’

Mike

–

Mike S. [email protected]
http://www.stok.ca/~mike/

The “`Stok’ disclaimers” apply.

giuan · September 27, 2010, 11:09am

On Mon, Sep 27, 2010 at 10:30 AM, Brian C. [email protected]
wrote:

It’s odd that ruby strives to be so perl-compatible in areas like this,
but is different in far more important areas (e.g. ^ matching newlines
within a string, not just the start of string)

Absolutely, there are a few gotchas:

http://www.advogato.org/person/fxn/diary/498.html

Don’t know why is that way, but I find them surprising.

giuan · September 27, 2010, 10:29am

Chad P. wrote:

I wouldn’t really call \1 a “Perlism”, given that the way I’ve always
seen it done is with $1 instead.

I called \1 a perlism mainly because it’s a sedism that perl inherited.
You’re right that in Perl you could instead write:

$str =~ s/(.)/$1$1/;

Of course, that doesn’t work in Ruby without using the block form:

str.sub(/(.)/, "$1$1")        # no!
str.sub(/(.)/, "#{$1}#{$1}")  # no!!
str.sub(/(.)/) {"#{$1}#{$1}"} # ok

in which case you could either argue that ruby needs sed’s \1 more than
perl does, or you could argue that ruby doesn’t need it at all.

It’s odd that ruby strives to be so perl-compatible in areas like this,
but is different in far more important areas (e.g. ^ matching newlines
within a string, not just the start of string)

Regards,

Brian.