Regular expressions question


#1

hello,

i need to capture all matches for a group. for example if

‘ab c’ =~ /^(.)*$/

i would like to get array [ ‘a’, ‘b’, ’ ', ‘c’ ]

could not figure out how to do it in ruby. String#scan did not seem to
be the right thing. please help.

thanks
konstantin


#2

On Dec 14, 2005, at 3:02 PM, ako… wrote:

hello,

i need to capture all matches for a group. for example if

‘ab c’ =~ /^(.)*$/

i would like to get array [ ‘a’, ‘b’, ’ ', ‘c’ ]

could not figure out how to do it in ruby. String#scan did not seem to
be the right thing. please help.

When using scan(), you need to remove the anchoring:

“ab c”.scan(/./)
=> [“a”, “b”, " ", “c”]

Hope that helps.

James Edward G. II


#3

On Wed, 14 Dec 2005 21:00:56 -0000, ako… removed_email_address@domain.invalid wrote:

i need to capture all matches for a group. for example if

‘ab c’ =~ /^(.)*$/

i would like to get array [ ‘a’, ‘b’, ’ ', ‘c’ ]

You could try:

irb(main):001:0> “ab c”.split(’’) # split on nothing
=> [“a”, “b”, " ", “c”]

irb(main):002:0> “ab c”.split(//) # same again
=> [“a”, “b”, " ", “c”]

irb(main):003:0> “ab c”.scan(/./) # scan on any single char
=> [“a”, “b”, " ", “c”]


#4

thank you. this was just an example. in general, is it possible to get
a collection of captures for a group without having to write custom
code?


#5

thank you. the question is general.

if i wanted to parse a list of letters separated by spaces and commas:

‘a , b,c’ =~ /^(?:(\w)\s*,\s*)*(\w)$/

i need to get [‘a’,‘b’] in group 1 and [‘c’] in group 2. yes, i know i
can split, then massage the result some more and get the final result.
is there a way to get to groups’ captures after a regex match? like in
microsoft’s .net?


#6

On Wed, 14 Dec 2005 21:34:52 -0000, ako… removed_email_address@domain.invalid wrote:

thank you. this was just an example. in general, is it possible to get
a collection of captures for a group without having to write custom
code?

Have to admit I’m not exactly a regex wiz, but I imagine it can be done
somehow. I assume you mean having a repeated capturing group append to
an
array any number of times?

But, I still think scan is a good tool for the job, it can do any regexp
anyway. I don’t think a single regexp is really intended for doing
variable numbers of captures anyway (?) ).

irb(main):054:0> “ab c”.scan(/\w|\s/)
=> [“a”, “b”, " ", “c”]

or

irb(main):052:0> “this is a test”.scan(/\w+/)
=> [“this”, “is”, “a”, “test”]

or even

irb(main):053:0> “this is a test”.scan(/\w+|\s/)
=> [“this”, " ", “is”, " ", “a”, " ", “test”]

Cheers,
Ross


#7

On Dec 14, 2005, at 4:03 PM, ako… wrote:

thank you. the question is general.

if i wanted to parse a list of letters separated by spaces and commas:

‘a , b,c’ =~ /^(?:(\w)\s*,\s*)*(\w)$/

i need to get [‘a’,‘b’] in group 1 and [‘c’] in group 2. yes, i know i
can split, then massage the result some more and get the final result.
is there a way to get to groups’ captures after a regex match? like in
microsoft’s .net?

Perl-style variables:

“abc” =~ /(.)(.)(.)/
=> 0

p [$1, $2, $3]
[“a”, “b”, “c”]
=> nil

Or object oriented:

md = “abc”.match(/(.)(.)(.)/)
=> #MatchData:0x325dc8

p [md[1], md[2], md[3]]
[“a”, “b”, “c”]
=> nil

Hope that helps.

James Edward G. II


#8

On Wed, 14 Dec 2005 21:59:27 -0000, ako… removed_email_address@domain.invalid wrote:

I don’t really get what you mean. I don’t understand the rules that got
a
and b into one group and c into another. When you say it’s a general
question, do you mean you just want access to the captures from some
regexp match?

irb(main):009:0> “a , b,c” =~ /(\w\s*?,\s*?\w)\s*?,\s*?(\w)/
=> 0
irb(main):010:0> $1
=> “a , b”
irb(main):011:0> $2
=> “c”
irb(main):012:0> $~[1]
=> “a , b”
irb(main):013:0> $~[2]
=> “c”
irb(main):014:0> md = /(\w\s*?,\s*?\w)\s*?,\s*?(\w)/.match(“a, b,c”)
=> #MatchData:0xb7a47860
irb(main):015:0> md[1]
=> “a, b”
irb(main):016:0> md.captures[1]
=> “c”
irb(main):017:0> $~.inspect
=> “#MatchData:0xb7a47860

(and others…)

Hope that helps,
Ross


#9

You should be able to tell who this message is meant for:

PLEASE stop sending out code that uses any of the perl ${x} variables

They are ugly and have no place in Ruby … they are only provided to
make the transition of Perl people easier …

Please teach people to use MatchData objects …

my_regex = /(\w\s*?.\s*?\w)\s*?.\s*?(\w)/

matches = my_regex.match( “a , b,c” )

element 0 of the matches object will contain the complete matched
string.

each element after that will map to one of the groups you defined …

so:

matches[0] will be the whole string
“a , b,c”
matches[1] will be your first group
“a , b”
matches[2] will be your second group
“c”

… seriously, we’re not helping people make cleaner code when we show
approval for the ugly/evil ${x} warts we’ve kept from Perl.

… show people the beauty and cleanliness of using an OOP solution …

I hope you agree.

j.

On 12/14/05, Ross B. removed_email_address@domain.invalid wrote:

is there a way to get to groups’ captures after a regex match? like in
irb(main):010:0> $1
=> “a, b”

Ross B. - removed_email_address@domain.invalid
“\e[1;31mL”


“Remember. Understand. Believe. Yield! -> http://ruby-lang.org

Jeff W.


#10

ako… wrote:

thank you. the question is general.

if i wanted to parse a list of letters separated by spaces and commas:

‘a , b,c’ =~ /^(?:(\w)\s*,\s*)*(\w)$/

i need to get [‘a’,‘b’] in group 1 and [‘c’] in group 2. yes, i know i
can split, then massage the result some more and get the final result.
is there a way to get to groups’ captures after a regex match? like in
microsoft’s .net?

t = ‘a , b,c’.split( /\s*,\s*/ )
group1 = t[0…-2]
group2 = t[-1,1]


#11

From: “Jeff W.” removed_email_address@domain.invalid

PLEASE stop sending out code that uses any of the perl ${x} variables …

They are ugly and have no place in Ruby … they are only provided to
make the transition of Perl people easier …

Thankfully, this is Ruby, and not Python with its rigid
Only One Way mentality.

Myself, though I’ve been aware of MatchData for going on
five years now, I find I don’t use it that often. The
$1…$n variables are perfectly legible to me. They have
a fine history too: not just Perl but awk, and Unix shell
programming . . .

Regards,

Bill


#12

On Thu, 15 Dec 2005 00:16:52 -0000, Jeff W. removed_email_address@domain.invalid
wrote:

You should be able to tell who this message is meant for:

Why not just address me directly?

PLEASE stop sending out code that uses any of the perl ${x} variables …

Well, okay. No need to shout though, is there?

Just trying to put a bit back, you know?


#13

Ross B. wrote:

Well, okay. No need to shout though, is there?

Just trying to put a bit back, you know?

Ross, don’t pay too much attention to unreasonable fanatics.

The first edition of the Pickaxe says:

“Having said all this, we have to 'fess up. Andy and Dave normally
use the $-variables rather than worrying about MatchData objects.
For everyday use, they just end up being more convenient.
Sometimes we just can’t help being pragmatic.”


#14

On 12/14/05, Jeff W. removed_email_address@domain.invalid wrote:

my_regex = /(\w\s*?.\s*?\w)\s*?.\s*?(\w)/
“a , b,c”
I hope you agree.

‘a , b,c’ =~ /^(?:(\w)\s*,\s*)*(\w)$/
question, do you mean you just want access to the captures from some
irb(main):013:0> $~[2]
(and others…)


“Remember. Understand. Believe. Yield! -> http://ruby-lang.org

Jeff W.

Regular expressions is the only area I still use Perl magic variables
because it’s concise, readable, and works well in that context. It feels
like a regexp standard to me.

The other magic variables I’ve dispensed with.

Nick


#15

Hi,

From: “ako…” removed_email_address@domain.invalid

i give up. there seems to be no way to get all the captures for a
group. the corresponding $ variable just has the last one.

Could you help us to understand why #scan didn’t meet your needs?

Called without a block, #scan returns an array of matches:

“abc--------abc--------abc”.scan(/(a)(b)©/)
=> [[“a”, “b”, “c”], [“a”, “b”, “c”], [“a”, “b”, “c”]]

Called with a block, #scan calls your block each time a match is
found:

“abc--------abc--------abc”.scan(/(a)(b)©/) { puts “#$1, #$2, #$3” }
a, b, c
a, b, c
a, b, c

Hope this helps,

Bill


#16

Bill,

scan does not help because it can match a portion of the source string,
and what is in between the matches is skipped. so scan is just a
special case of the functionality that i was looking for. i need to
make sure the whole string has a defined structure and get parts of it
as groups.

konstantin


#17

i give up. there seems to be no way to get all the captures for a
group. the corresponding $ variable just has the last one. thanks to
everyone who responded. sorry, did not mean to start a war over
people’s coding styles.

konstantin


#18

William J. wrote:

Ross, don’t pay too much attention to unreasonable fanatics.

The first edition of the Pickaxe says:

“Having said all this, we have to 'fess up. Andy and Dave normally
use the $-variables rather than worrying about MatchData objects.
For everyday use, they just end up being more convenient.
Sometimes we just can’t help being pragmatic.”

How convenient that you quote that without quoting the drawbacks listed
first…

Sheesh, if you want perl, use perl.


Neil S. - removed_email_address@domain.invalid

‘A republic, if you can keep it.’ – Benjamin Franklin


#19

From: “ako…” removed_email_address@domain.invalid

scan does not help because it can match a portion of the source string,
and what is in between the matches is skipped. so scan is just a
special case of the functionality that i was looking for. i need to
make sure the whole string has a defined structure and get parts of it
as groups.

Ah, OK thanks. From your earlier post:

if i wanted to parse a list of letters separated by spaces and commas:

‘a , b,c’ =~ /^(?:(\w)\s*,\s*)*(\w)$/

i need to get [‘a’,‘b’] in group 1 and [‘c’] in group 2.

What about:

‘a , b,c’ =~ /^((?:\w\s*,\s*)*)(\w)$/
last_match = $2
first_matches = $1.scan(/\w/)

Since we first verified the whole string conforms to the required
pattern, we can then safely perform the scan on the captured group
to obtain the individual matches.

Or we could write the scan using look-ahead assertions, as another
way to prevent the skipping of in-between parts:

str = ‘a , b,c’

first verify whole pattern matches, and get final match group

if str =~ /^(?:\w\s*,\s*)(\w)$/
last_match = $1
first_matches =
str.scan(/(?:(\w)\s
,\s*)(?=(?:\w\s*,\s*)*\w$)/).flatten
end

last_match => “c”

first_matches => [“a”, “b”]

HTH,

Bill


#20

On Dec 14, 2005, at 6:16 PM, Jeff W. wrote:

You should be able to tell who this message is meant for:

Yes, I recognize that you are probably speaking at least in part to
me, since I did that in this very thread. You can call me by name if
you like. I’m a big boy and I can take it. :wink:

PLEASE stop sending out code that uses any of the perl ${x}
variables …

Hang on there Mr. Code Police. Let’s not lay down the law down too
heavily before we get into this…

They are ugly and have no place in Ruby … they are only provided to
make the transition of Perl people easier …

I seriously doubt those variables were invented in Perl. They are a
common feature to many Regular Expression implementation and I’m not
sure they are even that ugly. $1 holds what was grabbed by the first
set of parenthesis. Fairly logical.

Please teach people to use MatchData objects …

I also showed a MatchData example.

I’ve used them a time or two, but honestly, they just don’t feel
right to me. I’ve stopped using the default variable, I’m using a
two-space tab, etc. I’m Ruby assimilated, but I just like the Regexp-
linked variables.

I see a lot of code running the Ruby Q. and I feel quite confident
saying that the Regexp variables are far more common than MatchData.
I don’t think that says anything bad about the latter, but it does
tell me that you are in the minority. :wink:

We won’t yell at you for using MatchData, if you’ll provide the same
consideration…

James Edward G. II