Problem replacing $data[abc] with $data['abc'] using gsub

luislavena · March 12, 2012, 3:34pm

In a string I would like to swap out all occurrences of $data[whatever]
with $data[‘whatever’]

Example:
print “”.$data[first]."
".$data[last]."";
should become
print “”.$data[‘first’]."
".$data[‘last’]."";

Could anyone give me a pointer as to the correct way to do this.
I tried using gsub and can successfully insert the apostrophes in the
new string, but am having trouble with what comes between the square
bracketrs, i.e. I can achieve

print “”.$data[’’]."
".$data[’’]."";

which is nice, but essentially useless.

Is gsub the way to go?
Am grateful for any advice.

yankees26 · March 12, 2012, 3:44pm

It’s all good, I answered my own question.

string.gsub(/$data[(.*?)]/){"$data[’#{$1}’]"}

Would still be interested to know if this is the best way.

yankees26 · March 12, 2012, 7:40pm

The part “.*?” of the regular expression is very inefficient, because it
will at first consume every character until the end of the line and then
try to find the minimum of characters needed.

It’s better to make an explicit expression for what you are looking for:

/$data[([^]]*)]/

This reads as “The String ‘$data[’ followed by zero or more characters,
which are not closing square brackets, followed by a closing square
bracket”.

Also you don’t need the block version of gsub. You can simple use a
substitute string and refer to the parenthesized subexpression by \1:

string.gsub /$data[([^]]*)]/, ‘$data[’\1’]’

Maybe the regular expression should also check if there are already
quotes.

yankees26 · March 13, 2012, 11:05am

The part “.*?” of the regular expression is very inefficient, because it
will at first consume every character until the end of the line and then
try to find the minimum of characters needed.
Oh right, I assumed it would find the first (least greedy) match and
stop.

It’s better to make an explicit expression for what you are looking for:

/$data[([^]]*)]/

This reads as “The String ‘$data[’ followed by zero or more characters,
which are not closing square brackets, followed by a closing square
bracket”.
Thanks for the explanation. That’s quite comprehensive.

Also you don’t need the block version of gsub. You can simple use a
substitute string and refer to the parenthesized subexpression by \1:

string.gsub /$data[([^]]*)]/, ‘$data[’\1’]’
This is what I was looking for. Does this have a name I could Google.

Maybe the regular expression should also check if there are already
quotes.
Using what you’ve demonstrated above, would this then be a good way to
exclude strings which contain a leading or trailing quote within the
square brackets?
/$data[([^’][^]]*[^’])]/

yankees26 · March 13, 2012, 11:20am

Joao S. wrote in post #1051150:

It’s all good, I answered my own question.

string.gsub(/$data[(.*?)]/){"$data[’#{$1}’]"}

Would still be interested to know if this is the best way.

You want to replace “bare strings” only. In other words, you want to
replace:

$data[aaa]

but do DON’T want to replace:

$data[aaa()]
$data[“aaa”]
$data[12]
$data[$aaa]

So you can use the following code:

document = document.gsub(
/$data [ ([a-z_][a-z\d_]*) ] /xi,
“$data[’\1’]”
)

yankees26 · March 13, 2012, 9:27pm

Jan E. wrote in post #1051180:

The part “.*?” of the regular expression is very inefficient

Is it? Have you measured it?

, because it
will at first consume every character until the end of the line and then
try to find the minimum of characters needed.

Does it? There are many implementations of ruby, which particular one(s)
are you referring to?

Your argument suggests that

/(.+?)(.+?)(.+?)(.+?)(.+?)/ =~ "a"*1_000_000

would be extremely inefficient, but actually it runs very fast for me.

So let’s demonstrate if you are right or wrong:

require ‘benchmark’

LONGSTR = (“a” * 1_000_000).freeze

Benchmark.bmbm do |x|
x.report(“chars”) { 1_000_000.times { /aaaaa/ =~ LONGSTR } }
x.report(“non-greedy”) { 1_000_000.times { /.?.?.?.?.*?/ =~
LONGSTR } }
end

And the results for me, using ruby 1.8.7 under Mac OSX Lion on a Macbook
Air i7:

Rehearsal ----------------------------------------------
chars 0.520000 0.000000 0.520000 ( 0.516861)
non-greedy 0.510000 0.000000 0.510000 ( 0.511089)
------------------------------------- total: 1.030000sec

             user     system      total        real

chars 0.510000 0.000000 0.510000 ( 0.505664)
non-greedy 0.510000 0.000000 0.510000 ( 0.511662)

I see no difference there.

Also you don’t need the block version of gsub. You can simple use a
substitute string and refer to the parenthesized subexpression by \1:

You can, but the block version is often clearer, especially if you are
doing things like backslash-escaping strings:

clear

a.gsub(/(.)/) { “\#{$1}” }

same result but horrible

a.gsub(/(.)/), “\\\1”)

yankees26 · March 13, 2012, 12:58pm

Joao S. wrote in post #1051235:

Also you don’t need the block version of gsub. You can simple use a
substitute string and refer to the parenthesized subexpression by \1:

string.gsub /$data[([^]]*)]/, ‘$data[’\1’]’
This is what I was looking for. Does this have a name I could Google.

It’s called backreference. You can also use it in your regular
expression to refer to subexpressions previously matched:

/(\w)\w*\1/

This matches any string that consists of word characters and begins and
ends with the same character.

Joao S. wrote in post #1051235:

Maybe the regular expression should also check if there are already
quotes.
Using what you’ve demonstrated above, would this then be a good way to
exclude strings which contain a leading or trailing quote within the
square brackets?
/$data[([^’][^]]*[^’])]/

Yes, but you should also exclude double quotes:

/$data[([^’"][^]]*[^’"])]/

Albert’s solution is probably even better. As a small modification, I
would include possible whitespace:

/$data[[ \t]([a-z_][a-z\d_])[ \t]*]/i

However, even this expression doesn’t cover each and every case. For
example, the word in the square brackets could be the identifier of a
method or a local variable (then it must not be quoted). But fixing this
would probably be too complex. You won’t be able to do it with a simple
regular expression.

yankees26 · March 13, 2012, 9:43pm

2012/3/13 Brian C. [email protected]:

are you referring to?

Your argument suggests that

/(.+?)(.+?)(.+?)(.+?)(.+?)/ =~ “a”*1_000_000

would be extremely inefficient, but actually it runs very fast for me.

So let’s demonstrate if you are right or wrong:

Sorry, but you’re both wrong. The expression, as far as I know,
will not first consume eveyrthing, and then back off (or at least
this is implementation-defined). Such regexps can, however, be slow.
(Although .+ can be slow as well.)

Let’s try something a tiny bit more complicated (but still simple).

irb(main):001:0> re = /.+?a.+?b/
=> /.+?a.+?b/
irb(main):002:0> s = “ac”*1_000_000 + “b”; nil
=> nil
irb(main):003:0> s =~ re
=> 0

This run fasts. Now, what if it can’t match?

irb(main):004:0> s = “ac”*1_000_000; nil
=> nil
irb(main):005:0> s =~ re

I have no idea what the output is - it’s been running for a few
minutes… the same happens if you substitute .+? with .+.

This is a bad case of catastrophic backtracking -
Runaway Regular Expressions: Catastrophic Backtracking - and stems from
the fact that matching a regex to a string is asymptotically
exponential.

In short - you should never use .+ or .+?, unless you have a very good
reason or know precisely what kinds of input you will get.

yankees26 · March 14, 2012, 10:28pm

Bartosz Dziewoński wrote in post #1051341:

This is a bad case of catastrophic backtracking -
Runaway Regular Expressions: Catastrophic Backtracking - and stems from
the fact that matching a regex to a string is asymptotically
exponential.

Well, matching a real regular expression to a string can be done in
linear time by a DFA, with no backtracking. (However, it’s true that the
conversion of some pathological regexps into DFAs can take exponential
resources)

The trouble is that “regexps” in programming languages seem to have
grown features beyond true regular expressions.

yankees26 · March 13, 2012, 10:06pm

Yeah, I was wrong about all characters being consumed. That’s how the
greedy quantifier would have worked (it wouldn’t make sense at all for
the non-greedy one).