REGEX simplest way to capture a repeating pattern

addis_a · October 4, 2014, 11:57pm

I am not attempting to solve a problem but to better understand how to
match repeating patterns in Ruby Regex. Given the string ‘str’ below , I
want to capture any item (fieldn) that resides in a field delimited by a
‘-’ (delimiter chosen not important

str = “start -field1- CCCCCCCC -field2- CCCCCCC -field3-
CCCCCCCCCCCCC -field4- end”

pat = /.\B-(.+)-\B.\B-(.+)-\B.*\B-(.+)-\B/
str.match(pat)
print “$1 ==> ‘#{$1}’ $2 ==> ‘#{$2}’ $3 ==> '#{$3}\n”
#output $1 ==> ‘field2’ $2 ==> ‘field3’ $3 ==> 'field4

This example works as long as I both limit the capture to 9 and know in
advance how many ‘-xxxx-’ fields there are in advance and hard code it
in the pattern. Above since I only capture 3 items and there are 4 in
the string I do not capture the first. item.

What is the simplest way to accomplish this for any number of items ( 9
or less to simplify) without knowing in advance how many items appear in
the string.

Thanks Don

rustysam · October 5, 2014, 12:08am

Sorry - when cut and pasted used a wrong version of the pattern, This is
the correct version - should be no leading .*

/\B-(.+)-\B.\B-(.+)-\B.\B-(.+)-\B/

rustysam · October 5, 2014, 12:22am

I am re-posting with only three items to be captured. Still looking for
same questions as above.

(I noticed after the fact that with 4 items the first capture is not
clean
‘field1- CCCCCCCC -field2’ and only the last 2
captures are clean ( $2 ==> ‘field3’ $3 ==> 'field4))

Use this as the code the questions and comments above are based on.

str = “start -field1- CCCCCCCC -field2- CCCCCCC -field3-
CCCCCCCCCCCCC end”
pat = /\B-(.+)-\B.\B-(.+)-\B.\B-(.+)-\B/
str.match(pat)
print “$1 ==> ‘#{$1}’ $2 ==> ‘#{$2}’ $3 ==> '#{$3}\n”

$1 ==> ‘field1’ $2 ==> ‘field2’ $3 ==> 'field3

Sorry

rustysam · October 5, 2014, 8:28pm

Look up String#scan to match multiple times.

rustysam · October 7, 2014, 5:08pm

I have looked at string scan and tested. It does exactly what I want
except it is not done completely in Regex.

Second post points out greedy match which I found as a problem but had a
less elegant solution to. The rest also uses string.scan,

I have spent at least 20 hours since the original post trying many
things (many of which unrelated to problem but discovered along the way)
and once I have finished I will re post the question but here is the
outline of it.

a single pattern that matches all occurrences of -xxx- in the string
“.* -xxx-.* -xxx-.* etc” without knowing in advance the number of
times the matching field will occur. Line below as given in the second
posted answer repeats capture field multiple times.
pat = /-(.+?)-.?-(.+?)-.?-(.+?)-/

Also string.scan will work with a single pattern
/\B-{1}(\w+)-{1}\B/ but have not determined if works under all
conditions. Perhaps \B is superfluous /\s{1}-{1}(\w+)-{1}\s{1}/ better??

I have always been able to obtain the correct results in regex alone as
long as I know the number of times the pattern will appear in the
string.

2 -(xxx)- capture the contents of all ‘- -’

Closest I have come is using back references but again you need to know
in advance how may times the field to be captured will appear.

This is an only an exercise in understanding regex so what I am
attempting to do may not even be possible in only regex alone. I just
can not prove it to myself. But perhaps that is why sting.scan(regex)
exits.

Thanks again
Don

rustysam · October 6, 2014, 7:01pm

Don N. wrote in post #1159035:

This example works as long as I both limit the capture to 9 and know in
advance how many ‘-xxxx-’ fields there are in advance

'’ is greedy operator ; it match as larger as possible
so the first -(.)- while ‘eat’ a maximum of field

the correction can be :
pat = /-(.+?)-.?-(.+?)-.?-(.+?)-/

What is the simplest way to accomplish this for any number of items ( 9
or less to simplify) without knowing in advance how many items appear in
the string.

p str.scan(/-(.*?)-/).flatten

or, more optimized :

p str.scan(/-([^-]*?)-/).flatten

this one works because the filed is terminated by one character.
for several characteres, by exemple, if termination by ‘–’, the only
solution is:

p str.scan(/–(.*?)–/).flatten

str=" --ee-- --aa-b-b-- --999.9e-1-- - - -"
str.scan(/–(.*?)–/).flatten
=> [“ee”, “aa-b-b”, “999.9e-1”]