Explain this ruby regex

unknown · October 3, 2008, 5:40pm

Can someone explain this regex …

“one two”.scan(/\w*/).length

returns 4. I can see it matching the 2 words and the space, what else
is it matching on? Is there a null terminator, I thought Ruby strings
were not null termed.

unknown · October 3, 2008, 5:46pm

On Sat, Oct 04, 2008, [email protected] wrote:

“one two”.scan(/\w*/).length

returns 4. I can see it matching the 2 words and the space, what else
is it matching on? Is there a null terminator, I thought Ruby strings
were not null termed.

Try replacing #length with #inspect and seeing what the output of scan
is. You’ll find that it’s returning two empty strings as well. I
suspect what you really want is \w+…

Ben

unknown · October 3, 2008, 7:56pm

On Oct 3, 11:44 am, Ben B. [email protected] wrote:

Ben

Yeah, you’re right \w+ will pull out the words, which is what I want
anyway. Though I’m trying to understand what \w* is doing.
irb(main):015:0> “one two”.scan(/\w*/).inspect
=> “["one", "", "two", ""]”

My question is, what is the last "", where does it come from.

unknown · October 3, 2008, 8:38pm

\w* does not match the space between string “one” and “two”. it matches
“one”, <empty string after “one”>, “two”, <empty string after “two”>.

There are some other examples:

irb(main):004:0> "one".scan(/^\w*/)
=> ["one"]
irb(main):005:0> "one".scan(/\w*$/)
=> ["one", ""]

–
Patrick

unknown · October 3, 2008, 9:29pm

The key idea here is that "" means “match zero or more of” whereas “+”
means “match one or more of”. So, when you match \w against “one two”,
there are zero or more instances of a word character (3, in fact, ‘o’,
‘n’,
and ‘e’), so that produces one result. Following that result, there are
zero matches of a word character, but since you asked for “zero or more
of”,
you get that empty string result. Later, rinse, repeat for the “two”
part.

FWIW, instead of looking at the result with #inspect, I found it more
informative to look at the result returned from #scan by itself, e.g.

irb> “one two”.scan(/\w*/)
=> [“one”, “”, “two”, “”]

–wpd

unknown · October 5, 2008, 6:21pm

On 03.10.2008 18:44, Patrick D. wrote:

The key idea here is that "" means “match zero or more of” whereas “+”
means “match one or more of”. So, when you match \w against “one two”,
there are zero or more instances of a word character (3, in fact, ‘o’, ‘n’,
and ‘e’), so that produces one result. Following that result, there are
zero matches of a word character, but since you asked for “zero or more of”,
you get that empty string result. Later, rinse, repeat for the “two” part.

It boils down to this statement: a subexpression with “*” potentially
matches an empty string anywhere in a string.

Kind regards

robert

unknown · October 3, 2008, 9:47pm

FWIW, instead of looking at the result with #inspect, I found it more
informative to look at the result returned from #scan by itself, e.g.

irb> “one two”.scan(/\w*/)
=> [“one”, “”, “two”, “”]

irb displays the expression value using “inspect”, so you are using
inspect even though you didn’t ask for it