Explain this ruby regex


#1

Can someone explain this regex …

“one two”.scan(/\w*/).length

returns 4. I can see it matching the 2 words and the space, what else
is it matching on? Is there a null terminator, I thought Ruby strings
were not null termed.


#2

On Sat, Oct 04, 2008, removed_email_address@domain.invalid wrote:

“one two”.scan(/\w*/).length

returns 4. I can see it matching the 2 words and the space, what else
is it matching on? Is there a null terminator, I thought Ruby strings
were not null termed.

Try replacing #length with #inspect and seeing what the output of scan
is. You’ll find that it’s returning two empty strings as well. I
suspect what you really want is \w+…

Ben


#3

On Oct 3, 11:44 am, Ben B. removed_email_address@domain.invalid wrote:

Ben

Yeah, you’re right \w+ will pull out the words, which is what I want
anyway. Though I’m trying to understand what \w* is doing.
irb(main):015:0> “one two”.scan(/\w*/).inspect
=> “[“one”, “”, “two”, “”]”

My question is, what is the last “”, where does it come from.


#4

\w* does not match the space between string “one” and “two”. it matches
“one”, <empty string after “one”>, “two”, <empty string after “two”>.

There are some other examples:

irb(main):004:0> "one".scan(/^\w*/)
=> ["one"]
irb(main):005:0> "one".scan(/\w*$/)
=> ["one", ""]


Patrick


#5

The key idea here is that "" means “match zero or more of” whereas “+”
means “match one or more of”. So, when you match \w
against “one two”,
there are zero or more instances of a word character (3, in fact, ‘o’,
‘n’,
and ‘e’), so that produces one result. Following that result, there are
zero matches of a word character, but since you asked for “zero or more
of”,
you get that empty string result. Later, rinse, repeat for the “two”
part.

FWIW, instead of looking at the result with #inspect, I found it more
informative to look at the result returned from #scan by itself, e.g.

irb> “one two”.scan(/\w*/)
=> [“one”, “”, “two”, “”]

–wpd


#6

On 03.10.2008 18:44, Patrick D. wrote:

The key idea here is that "" means “match zero or more of” whereas “+”
means “match one or more of”. So, when you match \w
against “one two”,
there are zero or more instances of a word character (3, in fact, ‘o’, ‘n’,
and ‘e’), so that produces one result. Following that result, there are
zero matches of a word character, but since you asked for “zero or more of”,
you get that empty string result. Later, rinse, repeat for the “two” part.

It boils down to this statement: a subexpression with “*” potentially
matches an empty string anywhere in a string.

Kind regards

robert


#7

FWIW, instead of looking at the result with #inspect, I found it more
informative to look at the result returned from #scan by itself, e.g.

irb> “one two”.scan(/\w*/)
=> [“one”, “”, “two”, “”]

irb displays the expression value using “inspect”, so you are using
inspect even though you didn’t ask for it :slight_smile: