Regex: capture groups and term binding

sijomo · September 28, 2007, 12:40pm

Hi All,

Let’s get down to it…

I have a long string of the form:

string = <<-EOVAR
XD 1 * 100000436 3441863 1550663 1161254 951982
XD 1 479903531056 47988002622 21360568539 18276299303 15476234490
XD 1 66934 5552 321640438 40297830 0
XD 1 0 3235 2197 10907 1631621
XD 1 15488078 210564267 574075997 2405132745 7805716381
XD 1 0 4949 0 58361 0
(goes for about 17 lines, all separated by \n)
<<EOVAR

I’m building a regex for this string and it’s pretty straightforward.
Only prerequisite is to capture all numbers for later Ruby fun:

regex = %r{XD\s2\s*\s(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\n …etc…
}mx

I would like to pare it down a bit, using term binding:

regex = %r{XD 1 * (\d+\s+){5}\n …etc…}mx

If I do this then only the last group is captured

pp var.scan(regex)
[[“951982\n”]]

If this worked, I could shorten it much much more… all of the lines
after the first one have exactly the same format and I need to capture
all of the variables.

mother_of_all_regexen = %r{XD\s1\s*\s((\d+\s+){5})\n(XD\s1
(\d+\s+){5})){17} }mx

or something

So,

Can I use capture groups and term binding?
Why am I only capturing the last term?
Should I just stop trying to be clever and explicitly match against
all parts of the string?

The reason I want to do this as a single regex is that I’ve written a
framework that grabs files, monkeys around with them and then applies
a rule-set from a YAML file to create output. For each “signature” in
the YAML file one can choose a defined action (match, count, compare
etc) which relate to methods in the main code. This allows the editor
of the YAML to add signatures etc to their hearts desire… And more
importantly, it means that I won’t have to maintain the ruleset.
(woohoo!)

Thanks in advance for any suggestion

SM

sijomo · September 28, 2007, 1:21pm

Hey Simon,

string = <<-EOVAR
XD 1 * 100000436 3441863 1550663 1161254 951982
XD 1 479903531056 47988002622 21360568539 18276299303 15476234490
XD 1 66934 5552 321640438 40297830 0
XD 1 0 3235 2197 10907 1631621
XD 1 15488078 210564267 574075997 2405132745 7805716381
XD 1 0 4949 0 58361 0
(goes for about 17 lines, all separated by \n)
<<EOVAR

Maybe I am seriously misunderstanding something, but why not just:

string.split(“\n”).map{|line| line.scan(/\d+/)} ?

Cheers,
Peter
__
http://www.rubyrailways.com
http://scrubyt.org

sijomo · September 28, 2007, 1:43pm

Hi Peter,

This is a good idea… I wasn’t clear in my original post but the
problem is that some of the lines have 3 (\d+), some 4 and some 5.
Also, there are 4 different groups of data sprinkled through a load of
log files.

Another way of slimming down the regex horror might be to use a bunch
of mini regexes and then using “recipes”.

So, a new method for the Regexp class (shamelessly plagiarized from this
group)

class Regexp
def +(other)
if other.is_a?(Regexp)
if self.options == other.options
Regexp.new(source + other.source, options)
else
Regexp.new(source + other.to_s, options)
end
else
Regexp.new(source + Regexp.escape(other.to_s), options)
end
end
end

r1 = %r{XD\s*\s}
r2 = %r{(\d)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\n}mx
r3 = %r{(\d)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\n}mx
r4 = %r{(\d)\s(\d+)\s(\d+)\s(\d+)\n}mx

recipe1 = r1 + r2 + r2 + r3 + r2 + r4 + r3 … and so on
recipe2 = r1 + r2 + r4 + r4 + r3 …

In the end I’ve used one huge whacking great regex for each “recipe” -
my main question was about can we combine capture groups and term
binding? It seems the precedence in the RE engine is to do the
captures first then unwind the binding. Or something.

Cheers

SM

sijomo · September 28, 2007, 2:00pm

Simon M. wrote:

Hi Peter,

This is a good idea… I wasn’t clear in my original post but the
problem is that some of the lines have 3 (\d+), some 4 and some 5.
Also, there are 4 different groups of data sprinkled through a load of
log files.

Could you please give an example of how the expected result looks like
for the above dataset? Possibly it’s not my day, but I still didn’t get
what are you trying to accomplish

The result of my solution was:

[[“1”, “100000436”, “3441863”, “1550663”, “1161254”, “951982”],
[“1”, “479903531056”, “47988002622”, “21360568539”, “18276299303”,
“15476234490”],
[“1”, “66934”, “5552”, “321640438”, “40297830”, “0”],
[“1”, “0”, “3235”, “2197”, “10907”, “1631621”],
[“1”, “15488078”, “210564267”, “574075997”, “2405132745”,
“7805716381”],
[“1”, “0”, “4949”, “0”, “58361”, “0”]]

How does the result you are expecting differ from the above one?

Cheers,
Peter
__
http://www.rubyrailways.com
http://scrubyt.org

sijomo · September 28, 2007, 3:02pm

Hi Peter,

Sorry if I’m not being clear - this is more a regex question than a ruby
one.

I’ll try again.

str = “100000436 3441863 1550663 1161254 951982”

re = %r{(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)}

Let’s shorten the re using grouping and a quantifier:
re2 = %r{(\d+)(?:\s(\d+)){4}}

pp re.match(str)
#<MatchData
“100000436 3441863 1550663 1161254 951982”
“100000436”
“3441863”
“1550663”
“1161254”
“951982”>

pp re2.match(str)
#<MatchData “100000436 3441863 1550663” “100000436” “1550663”>

so, either:

1 - My re2 regex is incorrect.
2 - You cannot do this with the ruby regex engine.

From experience, I’d guess it’s probably 1.

Thanks!

SM

sijomo · September 29, 2007, 5:20am

On Sep 28, 2007, at 7:43 AM, Simon M. wrote:

Hi Peter,

Hi Simon and Peter

    end
 end

end

This is also exactly (mostly) what is implemented the the aw3s0m3
(sic) library, TextualRegexp.

And let me say, DAMN Regexp.+ makes life easier

---------------------------------------------------------------|
~Ari
“I don’t suffer from insanity. I enjoy every minute of it” --1337est
man alive