Do You Understand Regular Expressions?

Hi all.

I’m pretty new to Ruby and that sort of thing, and I’m having a few
problems understanding regular expressions. I’m hoping one of you can
point me in the right direction.

I want to replace an entire string with another string. I know you
don’t need regular expressions for that, but it’s part of a more
generic approach. Anyway, the problem I’m having is that my regular
expressions are finding two matches instead of one, and I don’t
understand why. I’ve narrowed down my confusion to the following code,
which shows some output from irb:

irb(main):001:0> “hello”.scan(/.*/)
=> [“hello”, “”]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

The same thing can be seen when substituting - this is closer to how
I’m using regular expressions in my code:

irb(main):001:0> “hello”.gsub(/.*/, “P”)
=> “PP”

Two substitutions are made and I expected one. So am I right or wrong
to expect one substitution?

Please help - this is driving me nuts!

And in case it helps…

$ ruby --version
ruby 1.8.5 (2006-08-25) [i486-linux]

Thanks in advance.

[email protected] wrote:

irb(main):001:0> “hello”.scan(/.*/)
=> [“hello”, “”]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

Try anchoring the match: /^.*/

irb(main):001:0> “hello”.scan(/.*/)
=> [“hello”, “”]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

You can search for at least one occurrence like this:

“hello”.scan(/.+/)

“hello”.gsub(/.+/, “P”) => ‘P’

As an introduction, I find

quite instructive for the use of regexps in Ruby.

Best regards,

Axel

Axel E. wrote:

irb(main):001:0> “hello”.scan(/.*/)
=> [“hello”, “”]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

That doesn’t really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

“hello”.scan(/…*/)
=> [“hello”]

Hello Ryan

In message “Do You Understand Regular Expressions?”
on 21.06.2007, Ryan M. [email protected] writes:

RM> I agree. Can someone explain why gsub, sub or scan matches with *
are
RM> different than =~ matches with *

RM> puts “hello”.gsub(/[aeiou]/, ‘<\1>’) # h<>ll<>

irb(main):024:0> “hello”.gsub( /([aeiou])/, “<\1>” )

Please note the () around the expression.
After that you can refer with \1 to the found
letters.

RM> puts “hello”.gsub(/.*/, ‘<\1>’) # <><>

irb(main):029:0> “hello”.gsub(/(.*)/, ‘<\1>’)
=> “<>”
irb(main):030:0> “hello”.gsub(/(.+)/, ‘<\1>’)
=> “”

RM> print “before: #{$`}\n” # before: hello

irb(main):031:0> $`
=> “”

RM> print “match: #{$&}\n” # match:

irb(main):032:0> $&
=> “hello”

RM> print “after: #{$'}\n” # after:

irb(main):033:0> $’
=> “”

hope this helps.

regards.
Karl-Heinz

On 6/20/07, Daniel DeLorme [email protected] wrote:

That doesn’t really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

It’s because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches “zero or more” characters when it starts the
search for .* it matches the absence (the ‘zero’) and then matches the
string (the ‘or more’).

To prevent this you need to indicate to your regular expression that
you only want the subset of ‘everything’ that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn’t return the absence

/^./ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.
$/ makes this more
clear).

/…*/ will match everything after something, this is a modified form
of the above that isn’t tied to the start of the string

– Stephen

Daniel DeLorme wrote:

Axel E. wrote:

irb(main):001:0> “hello”.scan(/.*/)
=> [“hello”, “”]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

That doesn’t really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel

I agree. Can someone explain why gsub, sub or scan matches with * are
different than =~ matches with *

puts “hello”.gsub(/[aeiou]/, ‘<\1>’) # h<>ll<>
puts “hello”.gsub(/.*/, ‘<\1>’) # <><>
print “before: #{$`}\n” # before: hello
print “match: #{$&}\n” # match:
print “after: #{$’}\n” # after:

puts “hello” =~ (/.*/) # 0
print “before: #{$`}\n” # before:
print “match: #{$&}\n” # match: hello
print “after: #{$’}\n” # after:

thanks!

On Jun 21, 2007, at 9:47 AM, Stephen B. wrote:

Daniel

It’s because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

– Stephen

That still doesn’t really explain why “hello”.scan(/.*/) => [“hello”,
“”]

Why wouldn’t it be [“hello”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “”,
“”, … ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

-Rob

Rob B. http://agileconsultingllc.com
[email protected]

Hi –

On Thu, 21 Jun 2007, Stephen B. wrote:

It’s because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches “zero or more” characters when it starts the
search for .* it matches the absence (the ‘zero’) and then matches the
string (the ‘or more’).

It’s the other way around, though; it matches “hello” first, and
then “”. So the zero-matching (which I admit I’m among those who find
unexpected) is happening at the end.

To prevent this you need to indicate to your regular expression that
you only want the subset of ‘everything’ that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn’t return the absence

/^./ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.
$/ makes this more
clear).

Here, again, “hello” is first, so /^.*/ matches it but doesn’t match
the second time ("") because the “” isn’t anchored to ^.

David

On Jun 21, 4:43 am, Wild Karl-Heinz [email protected] wrote:

irb(main):024:0> “hello”.gsub( /([aeiou])/, “<\1>” )

Please note the () around the expression.
After that you can refer with \1 to the found
letters.

Why not simply change the 1 to a 0 ?

irb(main):001:0> puts “hello”.gsub(/[aeiou]/, ‘<\0>’)
hll

On 6/21/07, [email protected] [email protected] wrote:
[snip]

So: since * matches “zero or more” characters when it starts the
search for .* it matches the absence (the ‘zero’) and then matches the
string (the ‘or more’).

It’s the other way around, though; it matches “hello” first, and
then “”. So the zero-matching (which I admit I’m among those who find
unexpected) is happening at the end.

Ah, but notice:

“hello”.scan(/.*$/)
=> [“hello”, “”]

“hello”.scan(/^.*/)
=> [“hello”]

Strange indeed, but it seems that’s how it’s working. Although I
suspect I’m not fully grasping the subtleties introduced by the *
character.

Hmm, the more I think on it I think I have an answer:

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it’s parsed as “zero or more of anything
before the end of the string”.

So, if that’s correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

– Stephen

Hi –

On Fri, 22 Jun 2007, Stephen B. wrote:

character.

So, if that’s correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

Yes, that was what I was mostly going by :slight_smile:

David

On Jun 21, 2:16 am, Daniel DeLorme [email protected] wrote:

That doesn’t really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

At face value it is a surprising result, but upon a bit of
consideration
not illogical or faulty. The scan pattern finds first with greedy
matching
the “hello” string. As you’ve said, after that there is nothing to
match
anymore. But the pattern “/.*/” is a valid match on nothing also,
as * is zero or more occurances. For example:

irb(main):018:0> “hello”.scan(/./)
=> [“hello”, “”]
irb(main):019:0> “”.scan(/.
/)
=> [“”]

Compare this to using ‘+’ which specifies there must be zero or more
occurances:

irb(main):020:0> “hello”.scan(/.+/)
=> [“hello”]
irb(main):021:0> “”.scan(/.+/)
=> []

That’s also why anchoring to the start of the string removes the
behaviour while
anchoring to end does not.

On 6/21/07, Stephen B. [email protected] wrote:
[…]

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

^ anchors the match to beginning of a line or the beginning of the
string. The second match fails because it’s starting from the first
point after “hello”, where it left off. It says nothing about the
content that follows.

“”.scan /^.*/ => [“”]

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it’s parsed as “zero or more of anything
before the end of the string”.

This is correct. First it finds the longest match it can in “hello”.
Then it finds nothing, but still anchored at the end of the line. Note
that $ does not anchor the end of the string, but the end of each line
within the string or the very end. \z matches the actual end of
string, while \A does the same for the beginning.

Hope this helps.

On 2007-06-21 23:12:32 +0900 (Thu, Jun), Rob B. wrote:

“”, … ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever? It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

I would say the condition is checked at the right time, it’s just the
condition is different: it allows checking a match for empty string
at the end of just-matched string, it does not allow checking empty
string after ampty string.

The interesting behaviour is:

irb(main):035:0> “hello”.scan /.*?/
=> ["", “”, “”, “”, “”, “”]

The /.*?/ matches ‘zero or more characters, preferring the shortest
match’. One could ask - where have the actual characters gone?
Note that it’s not an infinite loop of empty strings.
After matching ‘nothing’, the start-position for next match is
increased, skipping one character, to prevent infinite loop of matching
nothing again.

This behavour may be considered weird, or buggy, and probably results
are not what was expected.

But look at:

irb(main):038:0> “hello”.scan /h(.)e/
=> [[""]]
irb(main):039:0> “hello”.scan /h(.
)(.)(.)(.)(.)e/
=> [["", “”, “”, “”, “”]]

Here ‘nothing’ matches many times, and definitely this is the expected
behaviour.

On 21.06.2007 16:12, Rob B. wrote:

… ] since I (or rather the OP) could continue to match zero characters
(bytes) at the end of the string forever? It does seem that it might be
that a termination condition is checked a bit later than it should be in
this case.

As far as I remember it works like this: first .* matches the whole
sequence. Then the “cursor” is placed behind the match, i.e. after the
last char of the match and matching starts again. At this place the
empty sequence matches because we’re at the end of the match. After
that match the cursor is advanced one step (to avoid endless
repetitions) and - alas! - we’re at the end of the string and matching
stops.

For learning regular expressions this is a great program: it allows to
graphically step through the matching process:
http://weitz.de/regex-coach/

See also this thread:
http://groups.google.de/group/comp.lang.ruby/browse_frm/thread/9bf7989dd42374f7/f759612390ff905f?lnk=st&q=&rnum=10#f759612390ff905f

Btw, for replacing the whole string this is much better:

irb(main):001:0> s = “foo”
=> “foo”
irb(main):002:0> s.object_id
=> 1073540760
irb(main):003:0> s.replace “bar”
=> “bar”
irb(main):004:0> s.object_id
=> 1073540760
irb(main):005:0> s
=> “bar”
irb(main):006:0>

Kind regards

robert

On 22.06.2007 14:15, [email protected] wrote:

then a match is guaranteed, because if there’s nothing to match, then
=> ["", “”, “”, “”, “”, “”]

I’ve always assumed, and used, .* to make everything before,
but I suppose .+ does make more sense. Although I have to say
I still find it a bit odd…

“.*” has its use but it’s generally overrated, i.e. more often used than
needed / wanted. If you show a more concrete example of what you are
doing we might be able to come up with better suggestions. If you are
really interested to dive into the matter then I suggest “Mastering
Regular Expressions” which is an excellent book for the money.

Kind regards

robert

It’s because the pattern /.*/ matches everything, including the
absence of everything.

So: since * matches “zero or more” characters when it starts the
search for .* it matches the absence (the ‘zero’) and then matches the
string (the ‘or more’).

It’s the other way around, though; it matches “hello” first, and
then “”. So the zero-matching (which I admit I’m among those who find
unexpected) is happening at the end.

Oh right, I think I get it now. If you try to match anything with *
then a match is guaranteed, because if there’s nothing to match, then
you’ll just match nothing?

Like this:

irb(main):001:0> “hello”.scan(/h*/)
=> [“h”, “”, “”, “”, “”, “”]

And this:

irb(main):002:0> “hello”.scan(/P*/)
=> ["", “”, “”, “”, “”, “”]

I’ve always assumed, and used, .* to make everything before,
but I suppose .+ does make more sense. Although I have to say
I still find it a bit odd…

Thanks everyone for your help.

On Jun 22, 2007, at 6:55 AM, Mariusz Pękala wrote:

“”]
string after ampty string.
increased, skipping one character, to prevent infinite loop of
matching
nothing again.

This behavour may be considered weird, or buggy, and probably
results
are not what was expected.

A great example which I do consider to be buggy. The similar
example from perl is something like:
$ perl -e ‘$h = “hello”; $h =~ s/.*?/[$&]/g; print “$h\n”;’
[][h][][e][][l][][l][][o][]

It matches the empty string at the beginning, between each character,
and at the end, but it does consume the actual characters of the
string. Even if not what one would anticipate, it’s not too hard to
justify the result. (Something that can’t be said for ruby’s
[“”,“”,“”,“”,“”,“”].)

The other versions from perl are enlightening:
$ perl -e ‘$h = “hello”; $h =~ s/.?/[$&]/g; print “$h\n”;’
[h][e][l][l][o][]

$ perl -e ‘$h = “hello”; $h =~ s/.*/[$&]/g; print “$h\n”;’
[hello][]

Both succeed in a zero-character match at the end. These are
equivalent in ruby (1.8.5):

$ ruby -e ‘puts “hello”.scan(/.?/).inspect’
[“h”, “e”, “l”, “l”, “o”, “”]

$ ruby -e ‘puts “hello”.scan(/.*/).inspect’
[“hello”, “”]

I thought I’d see what Oniguruma (5.8.0; with 1.1.0 gem) had to say:

irb> require ‘oniguruma’
=> true
irb> reluctant = Oniguruma::ORegexp.new(‘.?')
=> /.
?/
irb> greedy = Oniguruma::ORegexp.new(’.')
=> /.
/
irb> greedyq = Oniguruma::ORegexp.new(‘.?’)
=> /.?/
irb> reluctant.scan(“hello”)
=> [#MatchData:0x10b9aa4, #MatchData:0x10b9a7c, #<MatchData:
0x10b9a68>, #MatchData:0x10b9a40, #MatchData:0x10b9a18,
#MatchData:0x10b99f0]
irb> reluctant.scan(“hello”).map{|md|md[0]}
=> [“”, “”, “”, “”, “”, “”]
irb> greedy.scan(“hello”).map{|md|md[0]}
=> [“hello”, “”]
irb> greedyq.scan(“hello”).map{|md|md[0]}
=> [“h”, “e”, “l”, “l”, “o”, “”]

OK, the same result as the ruby Regexp. Including, that .*? produces
[“”]*6 which is the “before each character and at the end” locations
of the zero-length matches from perl, but the individual single-byte
matches are missing.

I presume that there’s some justification for these behaviors, but I
can’t figure out what it might be.

-Rob

But look at:

irb(main):038:0> “hello”.scan /h(.)e/
=> [[“”]]
irb(main):039:0> “hello”.scan /h(.
)(.)(.)(.)(.)e/
=> [[“”, “”, “”, “”, “”]]

Here ‘nothing’ matches many times, and definitely this is the
expected
behaviour.

I agree that those results are exactly what I’d expect.


No virus found in this outgoing message.
Checked by ‘grep -i virus $MESSAGE’
Trust me.

Rob B. http://agileconsultingllc.com
[email protected]