Interpreting "(.*?)" and "(?:\d+ ?)" in REs

Hi All,

I want to extract numbers from records with a leading string of period-
separated numbers. I got great responses to this on the thread
http://groups.google.com/group/comp.lang.ruby/browse_frm/thread/a811f41d733125f3#,
including the program below (stripped of all error handling).

My question is the meaning of a couple of constructs in the regular
expressions (and where I can find on-line documentation for them, if
possible):

  1. “(.*?)”, or specifically, the “?” in that expression.

  2. “(?:\d+ [.]?)”, or the two question marks in this case.

Thanks in advance,
Richard

Program

input = <<DATA
2.002.1Topic 2.2.1
2.1Topic 2.1
2.2.02Topic 2.2.2
DATA

input.each do |line|
puts “\n” + “=”*10 + “DBG”, line, "="10+ “DBG\n”
if line =~ /^ (.
?) [a-zA-Z] /x # Question 1
prefix = $1
if prefix =~ /^ (?:\d+ [.]?)+ $ /x # Question 2
arr = prefix.split(‘.’)
print " Numbers: ", arr.join(', '), “\n”
end
end
end # input

From: RichardOnRails

1. “(.*?)”, or specifically, the “?” in that expression.

2. “(?:\d+ [.]?)”, or the two question marks in this case.

consider regex as another language on its own. basically, it describes
string patterns, like a metastring, a string about a string… :slight_smile:

besides the mastering book, the online free perl doc is very informative
(and you can download the pdf too; in fact, i’m even tempted to copy it
and convert the samples to ruby. is that illegal? :slight_smile:

start here:
http://perldoc.perl.org/perlrequick.html

then here:
http://perldoc.perl.org/perlretut.html
http://perldoc.perl.org/perlre.html

input.each do |line|

btw, in ruby, you can do

DATA.each do …

and you can even do DATA.rewind :slight_smile:

kind regards -botp

On Nov 21, 7:32 pm, RichardOnRails
[email protected] wrote:

  1. “(.*?)”, or specifically, the “?” in that expression.

The ? in this case makes the match non-greedy. For example:

irb(main):007:0> s = “aaaaaaae”
=> “aaaaaaae”
irb(main):008:0> s[ /a+[aeiou]/ ]
=> “aaaaaaae”
irb(main):009:0> s[ /a+?[aeiou]/ ]
=> “aa”

By default, the ?, *, +, and {n,m} modifiers are all greedy,
attempting to match the longest substring possible while still
allowing the regular expression to succeed. As seen above, /a+/ keeps
finding a’s until it cannot find any more, and then goes on to try and
match the rest of the pattern.

Adding a ? after one of those quantifiers makes it non-greedy. For
example:

a?? - match zero or one ‘a’ characters (prefer to match zero)
a*? - match zero or more ‘a’ characters (prefer as few as possible)
a+? - match one or more ‘a’ characters (prefer as few as possible)
a{3,} - match at least 3 ‘a’ characters (prefer as few as possible)

As seen in the irb example above, /a+?/ matched a single ‘a’, and then
checked to see if it could find a vowel afterwards.

You’ll often see this non-greedy matching used in simple non-nested
pairing, like with HTML tags.
%r{

(.*?)

}
will match “

”, followed by the fewest number of characters until it
sees “

”.

Without the non-greedy quantifier, the .+ could skim right over other
closing “

” characters as long as at it could find one at the end.

  1. “(?:\d+ [.]?)”, or the two question marks in this case.

The first one is part of the (?:…) construct. While the parenthesis
in /(xxx)/ will save the match group for later matching or
substitution, putting a ?: pair at the front tells the regexp to not
bother saving the contents as a numbered group. For example:
/(?:foo|fu)?bar/
will match “foobar”, “fubar”, or “bar”, without saving “foo”, “fu”, or
“” as a group.

The second question mark follows a character set […], which itself
matches a single character from the options inside the set. The
question mark in this case (and in my “fubar” example above) means
“match zero or one of the preceding characters/group expressions”.
Since the character set has a single period inside it, this:
[.]?
means “And there may or may not be a period here.”

This is identical to the regexp:
.?
where the backslash escapes the traditional meaning of a period (match
any character [except possibly a newline]), and instead causes it to
mean a literal period.