Forum: Ruby gsub pattern substitution and ${...}

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Sarah A. (Guest)
on 2009-05-11 03:52
I'm trying to escape a URI that is matched by a regular expression with
gsub.

In irb, here's my string:
>> s = "<a href='http://foo.com/one=>two'/>"
=> "<a href='http://foo.com/one=>two'/>"

Now I want to match href="..." or href='...' and then URI.escape the
characters withing the quotes
>> require URI
==> true

First I tried this:
>> s.gsub(/href=(['"])([^']*)/, 'href=\1#{URL.escape($2)}\3')
=> "<a href='\#{URL.escape($2)}'/>"

Of course, that doesn't work since ${expr} will only eval the expression
within a double quoted string.

But when it is double quoted, like this:

>> s.gsub(/href=(['"])([^']*)/, "href=\1#{URI.escape($2)}\3")
=> "<a href=\001http://foo.com/one=%3Etwo\003'/>

\1 doesn't evaluate to the first match anymore

I would guess there's some basic string or regex syntax that I'm missing
here. I've looked at the gsub and string documentation, and either I
missed it or I should be looking elsewhere.

Can someone give me a clue and help me move forward with my mother's day
hacking session?

Thanks in advance,
Sarah
7stud -. (Guest)
on 2009-05-11 04:13
Sarah A. wrote:
>
> Of course, that doesn't work since ${expr} will only eval the expression
> within a double quoted string.
>
> But when it is double quoted, like this:
>
>>> s.gsub(/href=(['"])([^']*)/, "href=\1#{URI.escape($2)}\3")
> => "<a href=\001http://foo.com/one=%3Etwo\003'/>
>
> \1 doesn't evaluate to the first match anymore
>
> I would guess there's some basic string or regex syntax that I'm missing
> here.
>

In double quoted strings escaped characters do not have literal
meanings.  For instance, in a double quoted string "\n" is not two
characters--it is one character that represents a newline.  Double
quoted strings interpret all escaped characters, which means that \1
gets interpreted into something( but who knows what!).

On the other hand, with single quoted strings there are only a couple of
automatic substitutions that take place, and interpreting \1 is not one
of them.  So with single quoted strings \1 means \1.

If you need to use double quoted strings, then you need to literally
have \1 in your string, which requires the use of additional backslashes
to escape the \ in "\1".  So try "\\1".  With ruby if one backslash is
not enough, keep adding more backslashes until whatever you are trying
accomplish works!
Sarah A. (Guest)
on 2009-05-11 04:18
7stud -- wrote:
> If you need to use double quoted strings, then you need to literally
> have \1 in your string, which requires the use of additional backslashes
> to escape the \ in "\1".  So try "\\1".  With ruby if one backslash is
> not enough, keep adding more backslashes until whatever you are trying
> accomplish works!

Eureka!

>> s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
=> "<a href='http://foo.com/one=%3Etwo'/>"

Thanks so much for your help.

Sarah
Robert K. (Guest)
on 2009-05-11 10:41
(Received via mailing list)
On 11.05.2009 02:18, Sarah A. wrote:
> => "<a href='http://foo.com/one=%3Etwo'/>"
>
> Thanks so much for your help.

That does not work as 7stud did not mention the most important point:
even with proper escaping this won't work as the string interpolation
takes place *before* gsub is invoked and hence URI.escape will insert
something but not the matched portion.  In your tests it has probably
worked because $2 was properly set from the previous match.

In this case the block form of gsub is needed:

irb(main):007:0> s = "<a href='http://foo.com/one=>two'/>"
=> "<a href='http://foo.com/one=>two'/>"
irb(main):008:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href='http://foo.com/one=%3Etwo'/>"

irb(main):009:0> s = "<a href=\"http://foo.com/one=>two\"/>"
=> "<a href=\"http://foo.com/one=>two\"/>"
irb(main):010:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href=\"http://foo.com/one=%3Etwo\"/>"

And if quotes differ with my regexp no replacement takes place:

irb(main):011:0> s = "<a href=\"http://foo.com/one=>two'/>"
=> "<a href=\"http://foo.com/one=>two'/>"
irb(main):012:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href=\"http://foo.com/one=>two'/>"

Whether this is something you want or not depends on you but AFAIK
mixing quote types is not allowed here so we probably rather not want to
do the replacement in that case.

Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI.  I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Kind regards

  robert
Sebastian H. (Guest)
on 2009-05-11 12:23
(Received via mailing list)
Am Montag 11 Mai 2009 02:13:52 schrieb 7stud --:
> Double
> quoted strings interpret all escaped characters, which means that \1
> gets interpreted into something( but who knows what!).

The character with ASCII value 1.
Sarah A. (Guest)
on 2009-05-11 16:38
Robert K. wrote:
> even with proper escaping this won't work as the string interpolation
> takes place *before* gsub is invoked and hence URI.escape will insert
> something but not the matched portion.  In your tests it has probably
> worked because $2 was properly set from the previous match.

really? so the ${...} gets evaluated before the param is passed to gsub,
but the block is passed as code, so then it is evaluated after.

Using the previous attempt, with a fresh irb session, I can see the
issue:
>> $1
=> nil
>> $2
=> nil
>> $3
=> nil
>> require 'URI'
=> true
>> s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
NoMethodError: private method `gsub' called for nil:NilClass
  from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:289:in
`escape'
  from (irb):8

>> $1
=> nil
>> $2
=> nil
>> $3
=> nil
>> s = "<a href='http://foo.com/one=>two'/>"
=> "<a href='http://foo.com/one=>two'/>"
>> require 'URI'
=> true
>> s.gsub(/href=(["'])([^'"]+)\1/) {
?> "href=#$1#{URI.escape($2)}#$1" }
=> "<a href='http://foo.com/one=%3Etwo'/>"

Nice!

> Note that my regexp has another weakness: the quote character not used
> to quote the URI should be allowed as part of the URI.  I did not want
> to complicate things too much but if you want to deal with this the
> regular expression must be made a bit more complex.

Wow, interesting. That would be incorrect HTML that the browser doesn't
deal with well, so I'll not worry about it for this case, but I would be
curious how it might be handled.

Thanks so much,
Sarah
Robert K. (Guest)
on 2009-05-11 16:48
(Received via mailing list)
2009/5/11 Sarah A. <removed_email_address@domain.invalid>:
> Robert K. wrote:
>> even with proper escaping this won't work as the string interpolation
>> takes place *before* gsub is invoked and hence URI.escape will insert
>> something but not the matched portion.  In your tests it has probably
>> worked because $2 was properly set from the previous match.
>
> really? so the ${...} gets evaluated before the param is passed to gsub,

All method parameters are evaluated before method invocation - this is
true for every method invocation in Ruby.

> but the block is passed as code, so then it is evaluated after.

In the case of gsub the block is invoked once for each match.

>>> s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
>>> $3
> => nil
>>> s = "<a href='http://foo.com/one=>two'/>"
> => "<a href='http://foo.com/one=>two'/>"
>>> require 'URI'
> => true
>>> s.gsub(/href=(["'])([^'"]+)\1/) {
> ?> "href=#$1#{URI.escape($2)}#$1" }
> => "<a href='http://foo.com/one=%3Etwo'/>"
>
> Nice!

:-)

>> Note that my regexp has another weakness: the quote character not used
>> to quote the URI should be allowed as part of the URI.  I did not want
>> to complicate things too much but if you want to deal with this the
>> regular expression must be made a bit more complex.
>
> Wow, interesting. That would be incorrect HTML that the browser doesn't
> deal with well, so I'll not worry about it for this case, but I would be
> curious how it might be handled.

Basically you need an alternative and more capturing groups along the
lines of

'([^']+)'|"([^"]+)"

> Thanks so much,

You're welcome.

Kind regards

robert
Sarah A. (Guest)
on 2009-05-11 17:00
Robert K. wrote:
> All method parameters are evaluated before method invocation - this is
> true for every method invocation in Ruby.
>
>> but the block is passed as code, so then it is evaluated after.
>
> In the case of gsub the block is invoked once for each match.

This are really important details to understand.  Thanks for pointing
them out.

> Basically you need an alternative and more capturing groups along the
> lines of
>
> '([^']+)'|"([^"]+)"

Ah, of course.  I knew that, but didn't put it together.

I am so appreciative of the folks on this list.

Thank you 7stud, Sebastian & Robert!

Sarah
This topic is locked and can not be replied to.