Gsub pattern substitution and ${...}

sackerson · May 11, 2009, 1:52am

I’m trying to escape a URI that is matched by a regular expression with
gsub.

In irb, here’s my string:

s = “”
=> “”

Now I want to match href=“…” or href=‘…’ and then URI.escape the
characters withing the quotes

require URI
==> true

First I tried this:

s.gsub(/href=([‘"])([^’]*)/, ‘href=\1#{URL.escape($2)}\3’)
=> “”

Of course, that doesn’t work since ${expr} will only eval the expression
within a double quoted string.

But when it is double quoted, like this:

s.gsub(/href=([‘"])([^’]*)/, “href=\1#{URI.escape($2)}\3”)
=> "<a href=\001http://foo.com/one=%3Etwo\003’/>

\1 doesn’t evaluate to the first match anymore

I would guess there’s some basic string or regex syntax that I’m missing
here. I’ve looked at the gsub and string documentation, and either I
missed it or I should be looking elsewhere.

Can someone give me a clue and help me move forward with my mother’s day
hacking session?

Thanks in advance,
Sarah

sackerson · May 11, 2009, 2:13am

Sarah A. wrote:

Of course, that doesn’t work since ${expr} will only eval the expression
within a double quoted string.

But when it is double quoted, like this:

s.gsub(/href=([’"])([^’]*)/, “href=\1#{URI.escape($2)}\3”)
=> "<a href=\001http://foo.com/one=%3Etwo\003’/>

\1 doesn’t evaluate to the first match anymore

I would guess there’s some basic string or regex syntax that I’m missing
here.

In double quoted strings escaped characters do not have literal
meanings. For instance, in a double quoted string “\n” is not two
characters–it is one character that represents a newline. Double
quoted strings interpret all escaped characters, which means that \1
gets interpreted into something( but who knows what!).

On the other hand, with single quoted strings there are only a couple of
automatic substitutions that take place, and interpreting \1 is not one
of them. So with single quoted strings \1 means \1.

If you need to use double quoted strings, then you need to literally
have \1 in your string, which requires the use of additional backslashes
to escape the \ in “\1”. So try “\1”. With ruby if one backslash is
not enough, keep adding more backslashes until whatever you are trying
accomplish works!

sackerson · May 11, 2009, 2:18am

7stud – wrote:

If you need to use double quoted strings, then you need to literally
have \1 in your string, which requires the use of additional backslashes
to escape the \ in “\1”. So try “\1”. With ruby if one backslash is
not enough, keep adding more backslashes until whatever you are trying
accomplish works!

Eureka!

s.gsub(/href=([‘"])([^’]*)/, “href=\1#{URI.escape($2)}\3”)
=> “”

Thanks so much for your help.

Sarah

sackerson · May 11, 2009, 8:41am

On 11.05.2009 02:18, Sarah A. wrote:

=> “”

Thanks so much for your help.

That does not work as 7stud did not mention the most important point:
even with proper escaping this won’t work as the string interpolation
takes place before gsub is invoked and hence URI.escape will insert
something but not the matched portion. In your tests it has probably
worked because $2 was properly set from the previous match.

In this case the block form of gsub is needed:

irb(main):007:0> s = “”
=> “”
irb(main):008:0> s.gsub(/href=([“‘])([^’”]+)\1/) {
“href=#$1#{URI.escape($2)}#$1” }
=> “”

irb(main):009:0> s = “<a href="http://foo.com/one=>two"/>”
=> “<a href="http://foo.com/one=>two"/>”
irb(main):010:0> s.gsub(/href=([“‘])([^’”]+)\1/) {
“href=#$1#{URI.escape($2)}#$1” }
=> "<a href="http://foo.com/one=>two\“/>”

And if quotes differ with my regexp no replacement takes place:

irb(main):011:0> s = “<a href="http://foo.com/one=>two’/>”
=> “<a href="http://foo.com/one=>two’/>”
irb(main):012:0> s.gsub(/href=([“‘])([^’”]+)\1/) {
“href=#$1#{URI.escape($2)}#$1” }
=> “<a href="http://foo.com/one=>two’/>”

Whether this is something you want or not depends on you but AFAIK
mixing quote types is not allowed here so we probably rather not want to
do the replacement in that case.

Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI. I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Kind regards

robert

sackerson · May 11, 2009, 10:23am

Am Montag 11 Mai 2009 02:13:52 schrieb 7stud --:

Double
quoted strings interpret all escaped characters, which means that \1
gets interpreted into something( but who knows what!).

The character with ASCII value 1.

sackerson · May 11, 2009, 2:38pm

Robert K. wrote:

even with proper escaping this won’t work as the string interpolation
takes place before gsub is invoked and hence URI.escape will insert
something but not the matched portion. In your tests it has probably
worked because $2 was properly set from the previous match.

really? so the ${…} gets evaluated before the param is passed to gsub,
but the block is passed as code, so then it is evaluated after.

Using the previous attempt, with a fresh irb session, I can see the
issue:

$1
=> nil
$2
=> nil
$3
=> nil
require ‘URI’
=> true
s.gsub(/href=([‘"])([^’]*)/, “href=\1#{URI.escape($2)}\3”)
NoMethodError: private method gsub' called for nil:NilClass from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:289:in escape’
from (irb):8

$1
=> nil
$2
=> nil
$3
=> nil
s = “”
=> “”
require ‘URI’
=> true
s.gsub(/href=([“‘])([^’”]+)\1/) {
?> “href=#$1#{URI.escape($2)}#$1” }
=> “”

Nice!

Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI. I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Wow, interesting. That would be incorrect HTML that the browser doesn’t
deal with well, so I’ll not worry about it for this case, but I would be
curious how it might be handled.

Thanks so much,
Sarah

sackerson · May 11, 2009, 3:00pm

Robert K. wrote:

All method parameters are evaluated before method invocation - this is
true for every method invocation in Ruby.

but the block is passed as code, so then it is evaluated after.

In the case of gsub the block is invoked once for each match.

This are really important details to understand. Thanks for pointing
them out.

Basically you need an alternative and more capturing groups along the
lines of

‘([^’]+)’|"([^"]+)"

Ah, of course. I knew that, but didn’t put it together.

I am so appreciative of the folks on this list.

Thank you 7stud, Sebastian & Robert!

Sarah

sackerson · May 11, 2009, 2:48pm

2009/5/11 Sarah A. [email protected]:

Robert K. wrote:

even with proper escaping this won’t work as the string interpolation
takes place before gsub is invoked and hence URI.escape will insert
something but not the matched portion. In your tests it has probably
worked because $2 was properly set from the previous match.

really? so the ${…} gets evaluated before the param is passed to gsub,

All method parameters are evaluated before method invocation - this is
true for every method invocation in Ruby.

but the block is passed as code, so then it is evaluated after.

In the case of gsub the block is invoked once for each match.

s.gsub(/href=([‘"])([^’]*)/, “href=\1#{URI.escape($2)}\3”)
$3
=> nil
s = “”
=> “”
require ‘URI’
=> true
s.gsub(/href=([“‘])([^’”]+)\1/) {
?> “href=#$1#{URI.escape($2)}#$1” }
=> “”

Nice!

Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI. I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Wow, interesting. That would be incorrect HTML that the browser doesn’t
deal with well, so I’ll not worry about it for this case, but I would be
curious how it might be handled.

Basically you need an alternative and more capturing groups along the
lines of

‘([^’]+)'|“([^”]+)"

Thanks so much,

You’re welcome.

Kind regards

robert