Help needed with regular expression

glenn · May 8, 2007, 3:02pm

Hi

Im working through the “Best of ruby quiz” book which some of you
might be familiar with, but hey dont worry if not, you can probably
still help me - I’ve found a regular expression that does what I
want, but not quite sure why it works.

Given:

story = “The ((velocity)) ((colour)) ((wildbeast)) ((action)) over the
((adjective)) ((domesticbeast))”

I want to parse this into an array such that each element of the array
is the string split on the “((blabla))” bits.
This does that:

irb(main):052:0> story.split /((.*?))/
=> ["The ", " ", " ", " ", " over the ", " "]

However I also want the sections marked “((blabla))” included as
well… I fiddled a bit and got this, which works:
irb(main):053:0> story.split /(((.*?)))+/
=> ["The ", “((velocity))”, " ", “((colour))”, " ", “((wildbeast))”, "
", “((action))”, " over the ", “((adjective))”, " ",
“((domesticbeast))”]

However Im not exactly sure what makes this work - can anyone
illuminate this for me?

glenn

glenn · May 8, 2007, 6:05pm

Glenn, this tool may help you understand whats happening.
http://weitz.de/regex-coach/

Regular Expressions Reference has some good
explanations

Its great because you can type in a target string and play around with
the expression and see if it works. You’ll need to remove the / at the
beginning and end of the expression though.

There are a couple tabs in there too where you can see the decision
tree, how it will split, etc.

Since I don’t know how well you understand regex’s forgive me if this is
stating some of the obvious.
In your regular expression the ( is an escape character for the (
symbol. There are a couple things that require escapes, either because
they are special symbols, or things like \w which is a predefined
pattern [\w = letters and numbers only]

story = “The ((velocity)) ((colour)) ((wildbeast)) ((action)) over the
((adjective)) ((domesticbeast))”
irb(main):052:0> story.split /((.?))/
irb(main):053:0> story.split /(((.?)))+/

I’m not a master of regex’s but I’ll take a stab at this one.
you are asking the story varable to split the results into an array
anytime the pattern is matched.

so your first pattern is looking for ((X)) where X is another pattern.
X or .? is the dot [any non-line break character] 0 to infinity # of
times. the ? makes it optional and greedy which I believe means to find
the first / smallest possible match (so the . doesn’t keep going until
the last )) it encounters)

I’m a little less certain here so others may correct me or fill in the
bits I miss.

In the second pattern, the main difference is that the first pattern is
surrounded by parens, with a +.

By putting parens around your pattern you are grouping it. Generally it
is used so another operation can be preformed. An example pattern might
be
/sh(op|irt)/ where either shop or shirt will match, but shower will not.
In your case I believe it groups the ((X)) pattern so it can be matched
multiple times.

glenn · May 8, 2007, 6:29pm

On May 8, 9:01 am, glenn [email protected] wrote:

=> ["The ", " ", " ", " ", " over the ", " "]

However I also want the sections marked “((blabla))” included as
well… I fiddled a bit and got this, which works:
irb(main):053:0> story.split /(((.*?)))+/
=> ["The ", “((velocity))”, " ", “((colour))”, " ", “((wildbeast))”, "
", “((action))”, " over the ", “((adjective))”, " ",
“((domesticbeast))”]

However Im not exactly sure what makes this work - can anyone
illuminate this for me?

String#split will normally take a pattern representing a delimiter,
and split the string into parts that are separated by the delimiter,
returning the parts.

However, if you enclose the pattern in capturing parens, then split
returns both the parts and the delimiters.

So:

“foo-bar-baz”.split(/-/)
=> [“foo”, “bar”, “baz”]
“foo-bar-baz”.split(/(-)/)
=> [“foo”, “-”, “bar”, “-”, “baz”]

Your pattern is encosed in parens, so it will get returned along with
the parts between the pattern.

The pattern is:

(((.*?)))+

Working from the inside:

(( two literal left parens, followed by
.*? match shortest sequence of any char except \n, followed by
)) two literal right parens

This is wrapped in (), which are capturing parens (since they aren’t
escaped with a backslash)

The pattern is followed by a +, which means “occurring one or more
times”. You may not want this, because it would treat “((foo))((bar))”
as a single delimiter.

Now when you split on this, you get all the “((sometext))” elements,
together with the stuff in between them.

If you just want to capture the “((sometext))” words, you should look
at String#scan