Forum: Ruby gsub pattern substitution and ${...}

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
20ee0d23c969c7740d3c936a4675bb23?d=identicon&s=25 Sarah Allen (ultrasaurus)
on 2009-05-11 01:52
I'm trying to escape a URI that is matched by a regular expression with
gsub.

In irb, here's my string:
>> s = "<a href='http://foo.com/one=>two'/>"
=> "<a href='http://foo.com/one=>two'/>"

Now I want to match href="..." or href='...' and then URI.escape the
characters withing the quotes
>> require URI
==> true

First I tried this:
>> s.gsub(/href=(['"])([^']*)/, 'href=\1#{URL.escape($2)}\3')
=> "<a href='\#{URL.escape($2)}'/>"

Of course, that doesn't work since ${expr} will only eval the expression
within a double quoted string.

But when it is double quoted, like this:

>> s.gsub(/href=(['"])([^']*)/, "href=\1#{URI.escape($2)}\3")
=> "<a href=\001http://foo.com/one=%3Etwo\003'/>

\1 doesn't evaluate to the first match anymore

I would guess there's some basic string or regex syntax that I'm missing
here. I've looked at the gsub and string documentation, and either I
missed it or I should be looking elsewhere.

Can someone give me a clue and help me move forward with my mother's day
hacking session?

Thanks in advance,
Sarah
54404bcac0f45bf1c8e8b827cd9bb709?d=identicon&s=25 7stud -- (7stud)
on 2009-05-11 02:13
Sarah Allen wrote:
>
> Of course, that doesn't work since ${expr} will only eval the expression
> within a double quoted string.
>
> But when it is double quoted, like this:
>
>>> s.gsub(/href=(['"])([^']*)/, "href=\1#{URI.escape($2)}\3")
> => "<a href=\001http://foo.com/one=%3Etwo\003'/>
>
> \1 doesn't evaluate to the first match anymore
>
> I would guess there's some basic string or regex syntax that I'm missing
> here.
>

In double quoted strings escaped characters do not have literal
meanings.  For instance, in a double quoted string "\n" is not two
characters--it is one character that represents a newline.  Double
quoted strings interpret all escaped characters, which means that \1
gets interpreted into something( but who knows what!).

On the other hand, with single quoted strings there are only a couple of
automatic substitutions that take place, and interpreting \1 is not one
of them.  So with single quoted strings \1 means \1.

If you need to use double quoted strings, then you need to literally
have \1 in your string, which requires the use of additional backslashes
to escape the \ in "\1".  So try "\\1".  With ruby if one backslash is
not enough, keep adding more backslashes until whatever you are trying
accomplish works!
20ee0d23c969c7740d3c936a4675bb23?d=identicon&s=25 Sarah Allen (ultrasaurus)
on 2009-05-11 02:18
7stud -- wrote:
> If you need to use double quoted strings, then you need to literally
> have \1 in your string, which requires the use of additional backslashes
> to escape the \ in "\1".  So try "\\1".  With ruby if one backslash is
> not enough, keep adding more backslashes until whatever you are trying
> accomplish works!

Eureka!

>> s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
=> "<a href='http://foo.com/one=%3Etwo'/>"

Thanks so much for your help.

Sarah
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-05-11 08:41
(Received via mailing list)
On 11.05.2009 02:18, Sarah Allen wrote:
> => "<a href='http://foo.com/one=%3Etwo'/>"
>
> Thanks so much for your help.

That does not work as 7stud did not mention the most important point:
even with proper escaping this won't work as the string interpolation
takes place *before* gsub is invoked and hence URI.escape will insert
something but not the matched portion.  In your tests it has probably
worked because $2 was properly set from the previous match.

In this case the block form of gsub is needed:

irb(main):007:0> s = "<a href='http://foo.com/one=>two'/>"
=> "<a href='http://foo.com/one=>two'/>"
irb(main):008:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href='http://foo.com/one=%3Etwo'/>"

irb(main):009:0> s = "<a href=\"http://foo.com/one=>two\"/>"
=> "<a href=\"http://foo.com/one=>two\"/>"
irb(main):010:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href=\"http://foo.com/one=%3Etwo\"/>"

And if quotes differ with my regexp no replacement takes place:

irb(main):011:0> s = "<a href=\"http://foo.com/one=>two'/>"
=> "<a href=\"http://foo.com/one=>two'/>"
irb(main):012:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href=\"http://foo.com/one=>two'/>"

Whether this is something you want or not depends on you but AFAIK
mixing quote types is not allowed here so we probably rather not want to
do the replacement in that case.

Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI.  I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Kind regards

  robert
7a561ec0875fcbbe3066ea8fe288ec77?d=identicon&s=25 Sebastian Hungerecker (Guest)
on 2009-05-11 10:23
(Received via mailing list)
Am Montag 11 Mai 2009 02:13:52 schrieb 7stud --:
> Double
> quoted strings interpret all escaped characters, which means that \1
> gets interpreted into something( but who knows what!).

The character with ASCII value 1.
20ee0d23c969c7740d3c936a4675bb23?d=identicon&s=25 Sarah Allen (ultrasaurus)
on 2009-05-11 14:38
Robert Klemme wrote:
> even with proper escaping this won't work as the string interpolation
> takes place *before* gsub is invoked and hence URI.escape will insert
> something but not the matched portion.  In your tests it has probably
> worked because $2 was properly set from the previous match.

really? so the ${...} gets evaluated before the param is passed to gsub,
but the block is passed as code, so then it is evaluated after.

Using the previous attempt, with a fresh irb session, I can see the
issue:
>> $1
=> nil
>> $2
=> nil
>> $3
=> nil
>> require 'URI'
=> true
>> s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
NoMethodError: private method `gsub' called for nil:NilClass
  from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:289:in
`escape'
  from (irb):8

>> $1
=> nil
>> $2
=> nil
>> $3
=> nil
>> s = "<a href='http://foo.com/one=>two'/>"
=> "<a href='http://foo.com/one=>two'/>"
>> require 'URI'
=> true
>> s.gsub(/href=(["'])([^'"]+)\1/) {
?> "href=#$1#{URI.escape($2)}#$1" }
=> "<a href='http://foo.com/one=%3Etwo'/>"

Nice!

> Note that my regexp has another weakness: the quote character not used
> to quote the URI should be allowed as part of the URI.  I did not want
> to complicate things too much but if you want to deal with this the
> regular expression must be made a bit more complex.

Wow, interesting. That would be incorrect HTML that the browser doesn't
deal with well, so I'll not worry about it for this case, but I would be
curious how it might be handled.

Thanks so much,
Sarah
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-05-11 14:48
(Received via mailing list)
2009/5/11 Sarah Allen <sarah@ultrasaurus.com>:
> Robert Klemme wrote:
>> even with proper escaping this won't work as the string interpolation
>> takes place *before* gsub is invoked and hence URI.escape will insert
>> something but not the matched portion.  In your tests it has probably
>> worked because $2 was properly set from the previous match.
>
> really? so the ${...} gets evaluated before the param is passed to gsub,

All method parameters are evaluated before method invocation - this is
true for every method invocation in Ruby.

> but the block is passed as code, so then it is evaluated after.

In the case of gsub the block is invoked once for each match.

>>> s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
>>> $3
> => nil
>>> s = "<a href='http://foo.com/one=>two'/>"
> => "<a href='http://foo.com/one=>two'/>"
>>> require 'URI'
> => true
>>> s.gsub(/href=(["'])([^'"]+)\1/) {
> ?> "href=#$1#{URI.escape($2)}#$1" }
> => "<a href='http://foo.com/one=%3Etwo'/>"
>
> Nice!

:-)

>> Note that my regexp has another weakness: the quote character not used
>> to quote the URI should be allowed as part of the URI.  I did not want
>> to complicate things too much but if you want to deal with this the
>> regular expression must be made a bit more complex.
>
> Wow, interesting. That would be incorrect HTML that the browser doesn't
> deal with well, so I'll not worry about it for this case, but I would be
> curious how it might be handled.

Basically you need an alternative and more capturing groups along the
lines of

'([^']+)'|"([^"]+)"

> Thanks so much,

You're welcome.

Kind regards

robert
20ee0d23c969c7740d3c936a4675bb23?d=identicon&s=25 Sarah Allen (ultrasaurus)
on 2009-05-11 15:00
Robert Klemme wrote:
> All method parameters are evaluated before method invocation - this is
> true for every method invocation in Ruby.
>
>> but the block is passed as code, so then it is evaluated after.
>
> In the case of gsub the block is invoked once for each match.

This are really important details to understand.  Thanks for pointing
them out.

> Basically you need an alternative and more capturing groups along the
> lines of
>
> '([^']+)'|"([^"]+)"

Ah, of course.  I knew that, but didn't put it together.

I am so appreciative of the folks on this list.

Thank you 7stud, Sebastian & Robert!

Sarah
This topic is locked and can not be replied to.