Regular expressions, capture repeated groups

I’m trying to emulate something I’ve done in .Net many moons ago, which
is capture a named group, but not just once, get all it’s repetitions
and then be able to see all those repetitions. I think they call them
GroupCollections in C#. This is the kind of code I’m trying to emulate
with Ruby(1.9.1):

using System;
using System.Text.RegularExpressions;

public class Test
{

public static void Main ()
{

    // Define a regular expression for repeated words.
    Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
      RegexOptions.Compiled | RegexOptions.IgnoreCase);

    // Define a test string.
    string text = "The the quick brown fox  fox jumped over the lazy 

dog dog.";

    // Find matches.
    MatchCollection matches = rx.Matches(text);

    // Report the number of matches found.
    Console.WriteLine("{0} matches found in:\n   {1}",
                      matches.Count,
                      text);

    // Report on each match.
    foreach (Match match in matches)
    {
        GroupCollection groups = match.Groups;
        Console.WriteLine("'{0}' repeated at positions {1} and {2}",
                          groups["word"].Value,
                          groups[0].Index,
                          groups[1].Index);
    }

}

}
// The example produces the following output to the console:
// 3 matches found in:
// The the quick brown fox fox jumped over the lazy dog dog.
// ‘The’ repeated at positions 0 and 4
// ‘fox’ repeated at positions 20 and 25
// ‘dog’ repeated at positions 50 and 54

For example, if I had the string “11 12” I could have a regex like
/
(? \d+ ) \s \g
/x
that captured “11” and then the repetition “12” and put them in an
array (or some kind of collection) referenced by the name.

I think my attempts to get this to work are better explanations. What I
want is the result
#<MatchData “11 12” first:[“11”, “12”]> or something like it. At the
moment all my attempts end with the named capture only keeping the last
match it made i.e. 12 with no mention of 11.

I know I could do this a different way, perhaps with split or something,
but I’d like to know if it’s possible with just regex. I understand the
Oniguruma engine is used now but I can’t find any good docs for it.

These are my attempts, $ is my prompt.

$ md1 = /
(? \d+ )
\s \g
/x.match( “11 12” )
#<MatchData “11 12” first:“12”>

$ md1[:first]
“12”

$ md1 = /
(? \d+ )
(?: \s \g )?
/x.match( “11 12” )
#<MatchData “11 12” first:“12”>

$ md1[:first]
“12”

$ md1 = /
(? \d+ )
(?: \s
(? \g )
)?
/x.match( “11 12” )
#<MatchData “11 12” first:“12” second:“12”>

$ md1[:first]
“12”

$ md1[:second]
“12”

$ md1 = /
(?: (? \d+ )\s* )+
/x.match( “11 12” )
#<MatchData “11 12” first:“12”>

$ md1[:first]
“12”

Iain

On 8 Jul 2010, at 16:15, w_a_x_man wrote:

“The the quick brown fox fox jumped over the lazy dog dog.”.
scan(/((\w+) +\2)/i){|x| puts “#{ x[0] } #{ $~.offset(0)[0]}”}
The the 0
fox fox 20
dog dog 50

Thanks for that. That would certainly work to a degree, much better than
my current alternative, but it nullifies the usefulness of named
captures. For example, I can’t call

$ md1[:first]

and get back all the matches for the (? ) grouping, which would
be phenomenally useful, because scan returns arrays of strings and not
matchdata.

Iain

On Jul 8, 6:20 am, Iain B. [email protected] wrote:

    // Report the number of matches found.
                          groups[0].Index,

// ‘The’ repeated at positions 0 and 4
#<MatchData “11 12” first:[“11”, “12”]> or something like it. At the moment all my attempts end with the named capture only keeping the last match it made i.e. 12 with no mention of 11.

“12”
“12”
“12”

Iain

“The the quick brown fox fox jumped over the lazy dog dog.”.
scan(/((\w+) +\2)/i){|x| puts “#{ x[0] } #{ $~.offset(0)[0]}”}
The the 0
fox fox 20
dog dog 50

On 8 Jul 2010, at 18:01, botp wrote:

and get back all the matches for the (? ) grouping, which would be phenomenally useful, because scan returns arrays of strings and not matchdata.

waxman hinted the $~

best regards -botp

Ok, I get it now. Thanks for the extra nudge (bang on the head:)

Iain

On Fri, Jul 9, 2010 at 12:38 AM, Iain B. [email protected]
wrote:

Thanks for that. That would certainly work to a degree, much better than my current alternative, but it nullifies the usefulness of named captures. For example, I can’t call

$ md1[:first]

wait till you call the 21st :wink:

and get back all the matches for the (? ) grouping, which would be phenomenally useful, because scan returns arrays of strings and not matchdata.

waxman hinted the $~

try eg,

s
#=> “The the quick brown fox fox jumped over the lazy dog dog.”
m=[]
#=> []
s.scan(/((\w+) +\2)/i){|x| m << $~}
#=> “The the quick brown fox fox jumped over the lazy dog dog.”
m.size
#=> 3
m[0]
#=> #<MatchData “The the” 1:“The the” 2:“The”>
m[0].offset 0
#=> [0, 7]
m[0].offset

… and so fort…

best regards -botp