Short regexp question

Fritzek · September 18, 2008, 4:57pm

Hi folks

short question how to use regexp the right way
given is a=“a[b]c”
I nedd b=“a c”

tried to split using (/([|])/) to get [“a”, “[”, “b”, “]”, “c”]

b=a.strip.split(/([|])/)

and then joined the bits together. this just works in a simple case
like “a[b]c”, but “[b]” could occur multiple times.

I need something like search for any “[b]” and substitute with a
blank.

Thanks in advance

Fritzek

Fritzek · September 18, 2008, 5:02pm

Hi –

On Thu, 18 Sep 2008, Fritzek wrote:

and then joined the bits together. this just works in a simple case
like “a[b]c”, but “[b]” could occur multiple times.

I need something like search for any “[b]” and substitute with a
blank.

b = a.delete("[b]")

David

Fritzek · September 18, 2008, 5:14pm

I need something like search for any “[b]” and substitute with a
blank.

b = a.gsub(/[b]/,’ ')

Fritzek · September 18, 2008, 5:40pm

Fritzek wrote:

short question how to use regexp the right way
given is a=“a[b]c”
I nedd b=“a c”

“a[b]c”.gsub(/[.*?]/, " ")

HTH,
Sebastian

Fritzek · September 18, 2008, 5:16pm

b = a.gsub(/[b]/,’ ')

Also possibly useful is for you:

a = “aaa[b]bbb[b]ccc”
bits = a.split(/[b]/)

Fritzek · September 18, 2008, 7:12pm

Hi Brian

thanks for your answer. as I stated to David, I just know about the
surrounding brackets not the bits between them.

Fritzek

Fritzek · September 18, 2008, 7:12pm

Hi Sebastian

thanks for the solution. works perfect.

Fritzek

On 18 Sep., 17:31, Sebastian H. [email protected]

Fritzek · September 18, 2008, 5:49pm

Hi David

thanks for quick answer. your code just works, if you know “b”. I only
know the surrounding brackets “[” and “]” The bit in between could be
everything. sorry, forgot to mention.

Fritzek

Fritzek · September 19, 2008, 3:15pm

Hi Robert

thanks for your objection, but could you shortly explain the
difference (for regexp dummies like me)?

Fritzek

Fritzek · September 19, 2008, 3:32pm

2008/9/19 Fritzek [email protected]:

thanks for your objection, but could you shortly explain the
difference (for regexp dummies like me)?

Ideally you read “Mastering Regular Expressions” which explains such
topics very nicely.

I believe it is generally better to be more specific about what is to
match (mainly for robustness reasons). Also, with the reluctant
quantifier for every character in the input a match against the next
sub pattern needs to be tested OR there needs to be backtracking to
find out whether there is a shorter match afterwards. Both seem not
very efficient. Granted, this is no hard evidence, but if you are
curious I suggest you do some benchmarks and read the book; it’s
really good!

Kind regards

robert

Fritzek · September 19, 2008, 3:50pm

Hi Robert

thanks for explanation and book hint. will search for it.
Fritzek

Fritzek · September 19, 2008, 5:50pm

I wrote a quickie benchmark. CPU speed and compile options will
certainly influence your results.

http://snippets.dzone.com/posts/show/6098

Also, best intro to regular expressions ever:

Fritzek · September 18, 2008, 10:16pm

On 18.09.2008 17:47, Fritzek wrote:

given is a=“a[b]c”
I nedd b=“a c”
“a[b]c”.gsub(/[.*?]/, " ")

Not sure whether it makes a difference performance wise but I am always
reluctant to use reluctant quantifiers. I’d rather do

irb(main):003:0> “a[b]c”.gsub /[[^]]*]/, ’ ’
=> “a c”

Kind regards

robert

Fritzek · September 20, 2008, 1:49pm

2008/9/19 Tod B. [email protected]:

I wrote a quickie benchmark. CPU speed and compile options will
certainly influence your results.

http://snippets.dzone.com/posts/show/6098

Hm, it seems line 13 and 18 are identical. Where’s the lazy quantifier?

Here’s what I’d consider a better benchmark, as it covers the
scenarios I was talking about, especially with situations where there
is a second potential end point (“b” in this case):

robert@fussel /cygdrive/c/Temp
$ cat l.rb
#!/bin/env ruby

require ‘benchmark’

REP = 1_000
LONG = 1_000

STRINGS = [
[“short match”, “ab”],
[“short mismatch”, “a”],
[“long match”, “a” * LONG + “b”],
[“long mismatch”, “a” * LONG],
[“short match double”, “abab”],
[“long match double”, “a” * LONG + “bb”],
[“long match double long”, “a” * LONG + “b” + “a” * LONG + “b”],
]

Benchmark.bmbm(6 + STRINGS.inject(0) {|m,(a,b)| a.length > m ?
a.length : m }) do |b|
STRINGS.each do |label, str|
rep = /long mis/ =~ label ? 100 : 100_000

b.report "neg  " + label do
  rep.times { /a[^b]*b/ =~ str }
end

b.report "lazy " + label do
  rep.times { /a.*?b/ =~ str }
end

end
end

robert@fussel /cygdrive/c/Temp
$ ./l.rb
Rehearsal

neg short match 0.282000 0.000000 0.282000 (
0.288000)
lazy short match 0.297000 0.000000 0.297000 (
0.284000)
neg short mismatch 0.328000 0.000000 0.328000 (
0.341000)
lazy short mismatch 0.375000 0.000000 0.375000 (
0.366000)
neg long match 9.531000 0.000000 9.531000 (
9.982000)
lazy long match 12.625000 0.000000 12.625000 (
12.764000)
neg long mismatch 4.672000 0.000000 4.672000 (
4.742000)
lazy long mismatch 6.297000 0.000000 6.297000 (
6.422000)
neg short match double 0.297000 0.000000 0.297000 (
0.291000)
lazy short match double 0.281000 0.000000 0.281000 (
0.287000)
neg long match double 9.406000 0.000000 9.406000 (
9.443000)
lazy long match double 12.500000 0.000000 12.500000 (
12.592000)
neg long match double long 9.516000 0.000000 9.516000 (
9.642000)
lazy long match double long 12.547000 0.000000 12.547000 (
12.745000)
----------------------------------------------------- total:
78.954000sec

                              user     system      total        real

neg short match 0.312000 0.000000 0.312000 (
0.305000)
lazy short match 0.297000 0.000000 0.297000 (
0.301000)
neg short mismatch 0.375000 0.000000 0.375000 (
0.388000)
lazy short mismatch 0.359000 0.000000 0.359000 (
0.356000)
neg long match 9.344000 0.000000 9.344000 (
9.637000)
lazy long match 12.547000 0.000000 12.547000 (
12.777000)
neg long mismatch 4.703000 0.000000 4.703000 (
4.783000)
lazy long mismatch 6.219000 0.000000 6.219000 (
6.242000)
neg short match double 0.297000 0.000000 0.297000 (
0.301000)
lazy short match double 0.297000 0.000000 0.297000 (
0.297000)
neg long match double 9.453000 0.000000 9.453000 (
9.531000)
lazy long match double 12.718000 0.000000 12.718000 (
13.566000)
neg long match double long 9.407000 0.000000 9.407000 (
9.442000)
lazy long match double long 12.500000 0.000000 12.500000 (
12.777000)

robert@fussel /cygdrive/c/Temp

Notice how lazy is up to 30% slower for longer strings.

Also, best intro to regular expressions ever:

Regular Expression Tutorial - Learn How to Use Regular Expressions

Good ref!

Kind regards

robert

Fritzek · September 22, 2008, 5:23pm

Anyway, I think the moral of this particular long-missing story is, if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched

Fritzek · September 22, 2008, 5:26pm

2008/9/22 Tod B. [email protected]:

Anyway, I think the moral of this particular long-missing story is, if
you can regex test for smaller anchors first, you can then fail to
match much faster. IOW:

matched =false
if str.match(/b/)
matched = true if str.match(/a[^b]*b/)
end
matched

I am not sure. This approach is likely slower than a single fast RX -
at least if you expect matches most of the time. It all depends…

Kind regards

robert

Fritzek · September 23, 2008, 12:21am

On Sep 22, 2008, at 8:10 AM, Robert K. wrote:

matched

I am not sure. This approach is likely slower than a single fast RX -
at least if you expect matches most of the time. It all depends…

Also keep in mind that =~ is generally a lot faster then .match since
match has to build the full MatchData object even if you do not use it.

Cheers-
-Ezra

Fritzek · September 22, 2008, 4:39pm

On Sat, Sep 20, 2008 at 6:40 AM, Robert K.
[email protected] wrote:

Hm, it seems line 13 and 18 are identical. Where’s the lazy quantifier?

grr curse my copy paste skills. fixed. thanks for paying attention,
Robert. Your bm test is, of course, much more useful.

Fritzek · September 23, 2008, 9:44am

Also keep in mind that =~ is generally a lot faster then .match since
match has to build the full MatchData object even if you do not use it.

With =~ the MatchData can still be obtained from $~

Interestingly, not referencing the MatchData does give a big speed
improvement.

$ time ruby -e ‘5_000_000.times { /b/.match(“abc”) }’

real 0m28.699s
user 0m28.490s
sys 0m0.024s

$ time ruby -e ‘5_000_000.times { /b/ =~ “abc”; $~ }’

real 0m28.119s
user 0m27.910s
sys 0m0.024s

$ time ruby -e ‘5_000_000.times { /b/ =~ “abc” }’

real 0m14.311s
user 0m14.285s
sys 0m0.008s

$ ruby -v
ruby 1.8.6 (2008-03-03 patchlevel 114) [i686-linux]

Short regexp question

robert@fussel /cygdrive/c/Temp $ ./l.rb Rehearsal

robert@fussel /cygdrive/c/Temp
$ ./l.rb
Rehearsal