Match a pattern multiple times, returning matches, captures and offset?

luislavena · April 5, 2011, 7:22pm

Content preview: Hi, I’m used to be able to use the following in PHP.
What
is basically does is: return me all matches, including the captures,
order
by matching set and provide me the offsets. $ php -r
‘preg_match_all(“/(\w+)/”,
“foo bar”, $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE);
var_dump($matches);’
array(2) { [0]=> array(2) { [0]=> array(2) { [0]=> string(5) “foo”
[1]=>
int(0) } [1]=> array(2) { [0]=> string(3) “foo” [1]=> int(1) } }
[1]=> array(2)
{ [0]=> array(2) { [0]=> string(5) “bar” [1]=> int(6) } [1]=>
array(2)
{ [0]=> string(3) “bar” [1]=> int(7) } } } […]

Content analysis details: (-2.9 points, 5.0 required)

pts rule name description

-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-Cloudmark-Analysis: v=1.1
cv=HQ3F56nxkum+cgCiDL7AXQpbvw7DWrWCBJRnYYnM0Zc= c=1 sm=0
a=aofHTkXiRO8A:10 a=a8LjyqOez_YA:10 a=IkcTkHD0fZMA:10
a=zXYRzuxSnswNMOIj9CcA:9 a=F0ZZx-MZsyTlmd8l3nIA:7 a=QEXdDO2ut3YA:10
a=HpAAvcLHHh0Zw7uRqdWCyQ==:117
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk
Lines: 102
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: mailto:[email protected]
List-Owner: mailto:[email protected]
List-Help: mailto:[email protected]?body=help
List-Unsubscribe: mailto:[email protected]?body=unsubscribe
Received-SPF: none (Address does not pass the Sender Policy Framework)
SPF=FROM;
[email protected];
remoteip=::ffff:221.186.184.68;
remotehost=carbon.ruby-lang.org;
helo=carbon.ruby-lang.org;
receiver=eq4.andreas-s.net;

Hi,

I’m used to be able to use the following in PHP. What is basically does
is: return me all matches, including the captures, order by matching set
and provide me the offsets.

$ php -r ‘preg_match_all(“/(\w+)/”, “foo bar”, $matches,
PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);’
array(2) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) “foo”
[1]=>
int(0)
}
[1]=>
array(2) {
[0]=>
string(3) “foo”
[1]=>
int(1)
}
}
[1]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) “bar”
[1]=>
int(6)
}
[1]=>
array(2) {
[0]=>
string(3) “bar”
[1]=>
int(7)
}
}
}

I’ve found two ways in ruby getting in this direction, either use
String#match or String#scan, but both only provide me partial
information. I guess I can combine the knowledge of both, but before
attempting this I wanted to verify if I didn’t overlook something. Here
are my ruby attempts:

ruby-1.9.2-p180 :001 > m = “foo bar”.match(/(\w+)/)
=> #<MatchData “foo” 1:“foo”>
ruby-1.9.2-p180 :002 > [ m[0], m[1] ]
=> [“foo”, “foo”]
ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ]
=> [0, 1]

But here I’m missing the further possible matches, “bar” and “bar”. Or
the #scan approach:

ruby-1.9.2-p180 :004 > m = “foo bar”.scan(/(\w+)/)
=> [[“foo”], [“bar”]]

But in this case I’ve even less information, the match including foo
or bar is not present and I can’t get the offsets too.

I re-read the documentation for Regexp#match and found out that you can
pass an offset into the string as second parameter, so I guess I can
iterate over the string in a loop until I find no further matches …?
Considering this I came up with:

$ cat test_match_all.rb
require ‘pp’

class String
def match_all(pattern)
matches = []
offset = 0
while m = match(pattern, offset) do
matches << m
offset = m.begin(0) + m[0].length
end
matches
end
end

pp “foo bar baz”.match_all(/(\w+)/)

$ ruby test_match_all.rb
[#<MatchData “foo” 1:“foo”>,
#<MatchData “bar” 1:“bar”>,
#<MatchData “baz” 1:“baz”>]

I’ve lots of data to parse so I could foresee that this approach can
become a bottleneck. Is there a more direct solution to it?

thanks,

Markus

Markus_F · April 5, 2011, 8:07pm

String#scan with a block may do what you want:

“foo bar”.scan(/(\w+)/) { |x| puts “Offset #{$`.size}, captures
#{x.inspect}” }
Offset 0, captures [“foo”]
Offset 6, captures [“bar”]
=> “foo bar”

But it doesn’t give you offsets to the individual captures, just to the
start of the whole match. (You also get the full match in $& and the
rest of the string after the match in $’)

Markus_F · April 6, 2011, 3:37am

Markus F. wrote in post #991092:

But here I’m missing the further possible matches, “bar” and “bar”. Or
the #scan approach:

ruby-1.9.2-p180 :004 > m = “foo bar”.scan(/(\w+)/)
=> [[“foo”], [“bar”]]

But in this case I’ve even less information, the match including foo
or bar is not present and I can’t get the offsets too.

I re-read the documentation for Regexp#match

If you read the preamble in the docs for the MatchData class, you will
discover that besides match(), the class method Regexp.last_match also
returns a MatchData object, which you can call inside a scan() block:

str = “foo bar”

str.scan(/(\w+)/) do |curr_match|
md = Regexp.last_match
p [md[0], md[1], md.offset(0), md.offset(1)]
end

–output:–
[“foo”, “foo”, [0, 5], [1, 4]]
[“bar”, “bar”, [6, 11], [7, 10]]

If you need the offset of the grouping from the start of each match, you
can do a little subtraction, e.g. 1-0 and 7-6.

Also, the docs say that a MatchData object just collects all the
ruby/perl $ match variables that are available to you, so I think you
should be able to get the same info from curr_match.

Markus_F · April 6, 2011, 11:42am

On Wed, Apr 6, 2011 at 3:37 AM, 7stud – [email protected] wrote:

I re-read the documentation for Regexp#match

If you look at the preamble in the docs for the MatchData class, you can
retrieve a MatchData object using Regexp.last_match, which you can call
inside a scan() block:

When doing nested matching it may be better to use $~ because that is
local to the current stack frame which Regexp.last_match isn’t.
Example with relative offsets as well:

irb(main):022:0> str.scan /(\w+)/ do
irb(main):023:1* 2.times {|i| p [$~[i], $~.offset(i), $~.offset(i).map
{|o| o - $~.offset(0)[0]}]}
irb(main):024:1> end
[“foo”, [0, 5], [0, 5]]
[“foo”, [1, 4], [1, 4]]
[“bar”, [6, 11], [0, 5]]
[“bar”, [7, 10], [1, 4]]
=> “foo bar”

Kind regards

robert

Markus_F · April 7, 2011, 9:13am

On Thu, Apr 7, 2011 at 1:58 AM, 7stud – [email protected] wrote:

end
That’s nice! I wasn’t aware of this. Thanks for sharing!

I also just read this in the docs:

“Note that the last_match is local to the thread and method scope of the
method
that did the pattern match.”

So forget my point about $~ being safer.

Kind regards

robert

Markus_F · April 7, 2011, 1:57am

You can also get relative beginning offsets like this:

str = “foo bar”

str.scan(/(\w+)/) do |curr_match|
md = Regexp.last_match
whole_match = md[0]
captures = md.captures

captures.each do |capture|
p [whole_match, capture, whole_match.index(capture)]
end

end

–output:–
[“foo”, “foo”, 1]
[“bar”, “bar”, 1]

Markus_F · April 7, 2011, 9:04pm

Brian C. wrote in post #991406:

7stud – wrote in post #991338:

You can also get relative beginning offsets like this:

str = “foo bar”

str.scan(/(\w+)/) do |curr_match|
md = Regexp.last_match
whole_match = md[0]
captures = md.captures

captures.each do |capture|
p [whole_match, capture, whole_match.index(capture)]
end

end

Using ‘index’ doesn’t work if you have multiple captures which have the
same pattern, or one is a substring of the other.

Use captures.begin and captures.end instead.

md = /(…)(…)/.match “foofoo”
=> #<MatchData “foofoo” 1:“foo” 2:“foo”>

md.captures
=> [“foo”, “foo”]

md.begin(1)
=> 0

md.begin(2)
=> 3

I understand the problem you pointed out with my solution, so Robert K’s
solution is the one left standing. However, note that
begin() and end() are the two elements of offset(), which we’ve already
discussed above. The idea was to additionally provide the relative
offsets within a match, not just the absolute offsets within the string.

Markus_F · April 8, 2011, 9:17am

7stud – wrote in post #991546:

However, note that
begin() and end() are the two elements of offset(), which we’ve already
discussed above. The idea was to additionally provide the relative
offsets within a match, not just the absolute offsets within the string.

That’s easy - subtract begin(0) which is the absolute offset of the
start of the match.

“foo bar” =~ /ba(.)/
=> 4

$~.captures
=> [“r”]

$~.begin(1)
=> 6

$~.begin(1) - $~.begin(0)
=> 2

Markus_F · April 7, 2011, 10:39am

7stud – wrote in post #991338:

You can also get relative beginning offsets like this:

str = “foo bar”

str.scan(/(\w+)/) do |curr_match|
md = Regexp.last_match
whole_match = md[0]
captures = md.captures

captures.each do |capture|
p [whole_match, capture, whole_match.index(capture)]
end

end

Using ‘index’ doesn’t work if you have multiple captures which have the
same pattern, or one is a substring of the other.

Use captures.begin and captures.end instead.

md = /(…)(…)/.match “foofoo”
=> #<MatchData “foofoo” 1:“foo” 2:“foo”>

md.captures
=> [“foo”, “foo”]

md.begin(1)
=> 0

md.begin(2)
=> 3

Markus_F · April 8, 2011, 9:53pm

Brian C. wrote in post #991686:

7stud – wrote in post #991546:

However, note that
begin() and end() are the two elements of offset(), which we’ve already
discussed above. The idea was to additionally provide the relative
offsets within a match, not just the absolute offsets within the string.

That’s easy -
subtract begin(0) which is the absolute offset of the
start of the match.

The “subtraction method” was discussed earlier.