Forum: Ruby Do You Understand Regular Expressions?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
unknown (Guest)
on 2007-06-21 02:20
(Received via mailing list)
Hi all.

I'm pretty new to Ruby and that sort of thing, and I'm having a few
problems understanding regular expressions. I'm hoping one of you can
point me in the right direction.

I want to replace an entire string with another string. I know you
don't need regular expressions for that, but it's part of a more
generic approach. Anyway, the problem I'm having is that my regular
expressions are finding two matches instead of one, and I don't
understand why. I've narrowed down my confusion to the following code,
which shows some output from irb:

irb(main):001:0> "hello".scan(/.*/)
=> ["hello", ""]

I was expecting one match, not two, because .* matches everything,
right? Can someone explain why an empty string is also matched?

The same thing can be seen when substituting - this is closer to how
I'm using regular expressions in my code:

irb(main):001:0> "hello".gsub(/.*/, "P")
=> "PP"

Two substitutions are made and I expected one. So am I right or wrong
to expect one substitution?

Please help - this is driving me nuts!

And in case it helps...

$ ruby --version
ruby 1.8.5 (2006-08-25) [i486-linux]


Thanks in advance.
Tim H. (Guest)
on 2007-06-21 02:31
(Received via mailing list)
removed_email_address@domain.invalid wrote:
> irb(main):001:0> "hello".scan(/.*/)
> => ["hello", ""]
>
> I was expecting one match, not two, because .* matches everything,
> right? Can someone explain why an empty string is also matched?
>
Try anchoring the match: /^.*/
Axel E. (Guest)
on 2007-06-21 02:50
(Received via mailing list)
> irb(main):001:0> "hello".scan(/.*/)
> => ["hello", ""]
>
> I was expecting one match, not two, because .* matches everything,
> right? Can someone explain why an empty string is also matched?

String.scan searches for all occurrences of (any number of any
character) here. So zero occurrences is one match.

You can search for at least one occurrence like this:

"hello".scan(/.+/)

"hello".gsub(/.+/, "P") => 'P'

As an introduction, I find

http://www.regular-expressions.info/ruby.html

quite instructive for the use of regexps in Ruby.

Best regards,

Axel
Daniel DeLorme (Guest)
on 2007-06-21 03:17
(Received via mailing list)
Axel E. wrote:
>> irb(main):001:0> "hello".scan(/.*/)
>> => ["hello", ""]
>>
>> I was expecting one match, not two, because .* matches everything,
>> right? Can someone explain why an empty string is also matched?
>
> String.scan searches for all occurrences of (any number of any
> character) here. So zero occurrences is one match.

That doesn't really explain why the regexp finds an extra empty string.
I know that zero occurrences is one match but after a greedy match that
matches everything, there should be (logically?) no other match. I am no
stranger to regexps and the result is counter-intuitive to me; I would
consider it a bug. Or at least a very very peculiar behavior.

Daniel
List R. (Guest)
on 2007-06-21 03:55
(Received via mailing list)
"hello".scan(/..*/)
=> ["hello"]
Ryan M. (Guest)
on 2007-06-21 04:49
Daniel DeLorme wrote:
> Axel E. wrote:
>>> irb(main):001:0> "hello".scan(/.*/)
>>> => ["hello", ""]
>>>
>>> I was expecting one match, not two, because .* matches everything,
>>> right? Can someone explain why an empty string is also matched?
>>
>> String.scan searches for all occurrences of (any number of any
>> character) here. So zero occurrences is one match.
>
> That doesn't really explain why the regexp finds an extra empty string.
> I know that zero occurrences is one match but after a greedy match that
> matches everything, there should be (logically?) no other match. I am no
> stranger to regexps and the result is counter-intuitive to me; I would
> consider it a bug. Or at least a very very peculiar behavior.
>
> Daniel

I agree. Can someone explain why gsub, sub or scan matches with * are
different than =~ matches with *

puts "hello".gsub(/[aeiou]/, '<\1>')  # h<>ll<>
puts "hello".gsub(/.*/, '<\1>')       # <><>
print "before: #{$`}\n"               # before: hello
print "match:  #{$&}\n"               # match:
print "after:  #{$'}\n"               # after:

puts "hello" =~ (/.*/)                # 0
print "before: #{$`}\n"               # before:
print "match:  #{$&}\n"               # match:  hello
print "after:  #{$'}\n"               # after:


thanks!
Wild Karl-Heinz (Guest)
on 2007-06-21 12:44
(Received via mailing list)
Hello Ryan

In message "Do You Understand Regular Expressions?"
   on 21.06.2007, Ryan M. <removed_email_address@domain.invalid> writes:

RM> I agree. Can someone explain why gsub, sub or scan matches with *
are
RM> different than =~ matches with *

RM> puts "hello".gsub(/[aeiou]/, '<\1>')  # h<>ll<>

irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )

Please note the () around the expression.
After that you can refer with \\1 to the found
letters.


RM> puts "hello".gsub(/.*/, '<\1>')       # <><>

irb(main):029:0> "hello".gsub(/(.*)/, '<\1>')
=> "<hello><>"
irb(main):030:0> "hello".gsub(/(.+)/, '<\1>')
=> "<hello>"

RM> print "before: #{$`}\n"               # before: hello

irb(main):031:0> $`
=> ""

RM> print "match:  #{$&}\n"               # match:

irb(main):032:0> $&
=> "hello"

RM> print "after:  #{$'}\n"               # after:

irb(main):033:0> $'
=> ""


hope this helps.

regards.
Karl-Heinz
Stephen B. (Guest)
on 2007-06-21 17:48
(Received via mailing list)
On 6/20/07, Daniel DeLorme <removed_email_address@domain.invalid> wrote:
> That doesn't really explain why the regexp finds an extra empty string.
> I know that zero occurrences is one match but after a greedy match that
> matches everything, there should be (logically?) no other match. I am no
> stranger to regexps and the result is counter-intuitive to me; I would
> consider it a bug. Or at least a very very peculiar behavior.
>
> Daniel
>

It's because the pattern /.*/ matches everything, including the
absence of everything. Yes, with the proper regexs you can indeed have
tea and no tea at the same time. Certainly peculiar, but occasionally
useful.

So: since * matches "zero or more" characters when it starts the
search for .* it matches the absence (the 'zero') and then matches the
string (the 'or more').

To prevent this you need to indicate to your regular expression that
you only want the subset of 'everything' that is actually something.
Here are a couple ways to do this:

/.+/ will match 1 or more of something, so doesn't return the absence

/^.*/ will start the search at the start of the pattern, in a way
bypassing the match of zero (the pattern /^.*$/ makes this more
clear).

/..*/ will match everything after something, this is a modified form
of the above that isn't tied to the start of the string

-- Stephen
Rob B. (Guest)
on 2007-06-21 18:14
(Received via mailing list)
On Jun 21, 2007, at 9:47 AM, Stephen B. wrote:

>>
>> Daniel
>
> It's because the pattern /.*/ matches everything, including the
> absence of everything. Yes, with the proper regexs you can indeed have
> tea and no tea at the same time. Certainly peculiar, but occasionally
> useful.
> ...
> -- Stephen

That still doesn't really explain why "hello".scan(/.*/) => ["hello",
""]

Why wouldn't it be ["hello", "", "", "", "", "", "", "", "", "", "",
"", ... ] since I (or rather the OP) could continue to match zero
characters (bytes) at the end of the string forever?  It does seem
that it might be that a termination condition is checked a bit later
than it should be in this case.

-Rob

Rob B.    http://agileconsultingllc.com
removed_email_address@domain.invalid
unknown (Guest)
on 2007-06-21 18:27
(Received via mailing list)
Hi --

On Thu, 21 Jun 2007, Stephen B. wrote:

> It's because the pattern /.*/ matches everything, including the
> absence of everything. Yes, with the proper regexs you can indeed have
> tea and no tea at the same time. Certainly peculiar, but occasionally
> useful.
>
> So: since * matches "zero or more" characters when it starts the
> search for .* it matches the absence (the 'zero') and then matches the
> string (the 'or more').

It's the other way around, though; it matches "hello" *first*, and
then "".  So the zero-matching (which I admit I'm among those who find
unexpected) is happening at the end.

> To prevent this you need to indicate to your regular expression that
> you only want the subset of 'everything' that is actually something.
> Here are a couple ways to do this:
>
> /.+/ will match 1 or more of something, so doesn't return the absence
>
> /^.*/ will start the search at the start of the pattern, in a way
> bypassing the match of zero (the pattern /^.*$/ makes this more
> clear).

Here, again, "hello" is first, so /^.*/ matches it but doesn't match
the second time ("") because the "" isn't anchored to ^.


David
Brian A. (Guest)
on 2007-06-21 18:46
(Received via mailing list)
On Jun 21, 4:43 am, Wild Karl-Heinz <removed_email_address@domain.invalid> 
wrote:
> irb(main):024:0> "hello".gsub( /([aeiou])/, "<\\1>" )
>
> Please note the () around the expression.
> After that you can refer with \\1 to the found
> letters.

Why not simply change the 1 to a 0 ?

irb(main):001:0> puts "hello".gsub(/[aeiou]/, '<\0>')
h<e>ll<o>
Stephen B. (Guest)
on 2007-06-21 20:28
(Received via mailing list)
On 6/21/07, removed_email_address@domain.invalid 
<removed_email_address@domain.invalid> wrote:
[snip]
> > So: since * matches "zero or more" characters when it starts the
> > search for .* it matches the absence (the 'zero') and then matches the
> > string (the 'or more').
>
> It's the other way around, though; it matches "hello" *first*, and
> then "".  So the zero-matching (which I admit I'm among those who find
> unexpected) is happening at the end.
>

Ah, but notice:

"hello".scan(/.*$/)
=> ["hello", ""]

"hello".scan(/^.*/)
=> ["hello"]

Strange indeed, but it seems that's how it's working. Although I
suspect I'm not fully grasping the subtleties introduced by the *
character.

Hmm, the more I think on it I think I have an answer:

The /^.*/ pattern specifies that the string must start with anything
(e.g. it must have at least one character) and then zero or more
characters following.

The /.*$/ pattern has no restriction since the anchor is on the side
with the * character. So it's parsed as "zero or more of anything
before the end of the string".

So, if that's correct, you are right that the absence is matched last.
Verified by the fact that the absence follows the string in the
pattern match.

-- Stephen
unknown (Guest)
on 2007-06-21 20:35
(Received via mailing list)
Hi --

On Fri, 22 Jun 2007, Stephen B. wrote:

>
> character.
>
> So, if that's correct, you are right that the absence is matched last.
> Verified by the fact that the absence follows the string in the
> pattern match.

Yes, that was what I was mostly going by :-)


David
ajalkane (Guest)
on 2007-06-21 22:22
(Received via mailing list)
On Jun 21, 2:16 am, Daniel DeLorme <removed_email_address@domain.invalid> wrote:
> That doesn't really explain why the regexp finds an extra empty string.
> I know that zero occurrences is one match but after a greedy match that
> matches everything, there should be (logically?) no other match. I am no
> stranger to regexps and the result is counter-intuitive to me; I would
> consider it a bug. Or at least a very very peculiar behavior.

At face value it is a surprising result, but upon a bit of
consideration
not illogical or faulty. The scan pattern finds first with greedy
matching
the "hello" string. As you've said, after that there is nothing to
match
anymore. But the pattern "/.*/" is a valid match on nothing also,
as * is zero or more occurances. For example:

irb(main):018:0> "hello".scan(/.*/)
=> ["hello", ""]
irb(main):019:0> "".scan(/.*/)
=> [""]

Compare this to using '+' which specifies there must be zero or more
occurances:

irb(main):020:0> "hello".scan(/.+/)
=> ["hello"]
irb(main):021:0> "".scan(/.+/)
=> []

That's also why anchoring to the start of the string removes the
behaviour while
anchoring to end does not.
Sami Samhuri (Guest)
on 2007-06-21 22:23
(Received via mailing list)
On 6/21/07, Stephen B. <removed_email_address@domain.invalid> wrote:
[...]
> The /^.*/ pattern specifies that the string must start with anything
> (e.g. it must have at least one character) and then zero or more
> characters following.

^ anchors the match to beginning of a line or the beginning of the
string. The second match fails because it's starting from the first
point after "hello", where it left off. It says nothing about the
content that follows.

"".scan /^.*/ => [""]

> The /.*$/ pattern has no restriction since the anchor is on the side
> with the * character. So it's parsed as "zero or more of anything
> before the end of the string".

This is correct. First it finds the longest match it can in "hello".
Then it finds nothing, but still anchored at the end of the line. Note
that $ does not anchor the end of the string, but the end of each line
within the string or the very end. \z matches the actual end of
string, while \A does the same for the beginning.

Hope this helps.
Mariusz Pękala (Guest)
on 2007-06-22 14:56
(Received via mailing list)
On 2007-06-21 23:12:32 +0900 (Thu, Jun), Rob B. wrote:
> "", ... ] since I (or rather the OP) could continue to match zero
> characters (bytes) at the end of the string forever?  It does seem
> that it might be that a termination condition is checked a bit later
> than it should be in this case.

I would say the condition is checked at the right time, it's just the
condition is different: it allows checking a match for empty string
at the end of just-matched string, it does not allow checking empty
string after ampty string.

The interesting behaviour is:

irb(main):035:0> "hello".scan /.*?/
=> ["", "", "", "", "", ""]

The /.*?/ matches 'zero or more characters, preferring the shortest
match'. One could ask - where have the actual characters gone?
Note that it's not an infinite loop of empty strings.
After matching 'nothing', the start-position for next match is
increased, skipping one character, to prevent infinite loop of matching
nothing again.

*This* behavour may be considered weird, or buggy, and probably results
are not what was expected.

But look at:

irb(main):038:0> "hello".scan /h(.*)e/
=> [[""]]
irb(main):039:0> "hello".scan /h(.*)(.*)(.*)(.*)(.*)e/
=> [["", "", "", "", ""]]

Here 'nothing' matches many times, and definitely this *is* the expected
behaviour.
unknown (Guest)
on 2007-06-22 16:21
(Received via mailing list)
> > It's because the pattern /.*/ matches everything, including the
> > absence of everything.
>
> > So: since * matches "zero or more" characters when it starts the
> > search for .* it matches the absence (the 'zero') and then matches the
> > string (the 'or more').
>
> It's the other way around, though; it matches "hello" *first*, and
> then "".  So the zero-matching (which I admit I'm among those who find
> unexpected) is happening at the end.

Oh right, I think I get it now. If you try to match anything with *
then a match is guaranteed, because if there's nothing to match, then
you'll just match nothing?

Like this:

irb(main):001:0> "hello".scan(/h*/)
=> ["h", "", "", "", "", ""]

And this:

irb(main):002:0> "hello".scan(/P*/)
=> ["", "", "", "", "", ""]


I've always assumed, and used, .* to make everything before,
but I suppose .+ does make more sense. Although I have to say
I still find it a bit odd...

Thanks everyone for your help.
Robert K. (Guest)
on 2007-06-22 17:06
(Received via mailing list)
On 21.06.2007 16:12, Rob B. wrote:
>>
> ... ] since I (or rather the OP) could continue to match zero characters
> (bytes) at the end of the string forever?  It does seem that it might be
> that a termination condition is checked a bit later than it should be in
> this case.

As far as I remember it works like this: first .* matches the whole
sequence.  Then the "cursor" is placed behind the match, i.e. after the
last char of the match and matching starts again.  At this place the
empty sequence matches because we're at the end of the match.  After
that match the cursor is advanced one step (to avoid endless
repetitions) and - alas! - we're at the end of the string and matching
stops.

For learning regular expressions this is a great program: it allows to
graphically step through the matching process:
http://weitz.de/regex-coach/

See also this thread:
http://groups.google.de/group/comp.lang.ruby/brows...

Btw, for replacing the whole string this is much better:

irb(main):001:0> s = "foo"
=> "foo"
irb(main):002:0> s.object_id
=> 1073540760
irb(main):003:0> s.replace "bar"
=> "bar"
irb(main):004:0> s.object_id
=> 1073540760
irb(main):005:0> s
=> "bar"
irb(main):006:0>

Kind regards

  robert
Robert K. (Guest)
on 2007-06-22 17:11
(Received via mailing list)
On 22.06.2007 14:15, removed_email_address@domain.invalid wrote:
> then a match is guaranteed, because if there's nothing to match, then
> => ["", "", "", "", "", ""]
>
>
> I've always assumed, and used, .* to make everything before,
> but I suppose .+ does make more sense. Although I have to say
> I still find it a bit odd...

".*" has its use but it's generally overrated, i.e. more often used than
needed / wanted.  If you show a more concrete example of what you are
doing we might be able to come up with better suggestions.  If you are
really interested to dive into the matter then I suggest "Mastering
Regular Expressions" which is an excellent book for the money.

Kind regards

  robert
Rob B. (Guest)
on 2007-06-22 17:56
(Received via mailing list)
On Jun 22, 2007, at 6:55 AM, Mariusz Pękala wrote:
>> ""]
> string after ampty string.
> increased, skipping one character, to prevent infinite loop of
> matching
> nothing again.
>
> *This* behavour may be considered weird, or buggy, and probably
> results
> are not what was expected.

A great example which I *do* consider to be buggy.  The similar
example from perl is something like:
$ perl -e '$h = "hello"; $h =~ s/.*?/[$&]/g; print "$h\n";'
[][h][][e][][l][][l][][o][]

It matches the empty string at the beginning, between each character,
and at the end, but it does consume the actual characters of the
string.  Even if not what one would anticipate, it's not too hard to
justify the result. (Something that can't be said for ruby's
["","","","","",""].)

The other versions from perl are enlightening:
$ perl -e '$h = "hello"; $h =~ s/.?/[$&]/g; print "$h\n";'
[h][e][l][l][o][]

$ perl -e '$h = "hello"; $h =~ s/.*/[$&]/g; print "$h\n";'
[hello][]

Both succeed in a zero-character match at the end.  These are
equivalent in ruby (1.8.5):

$ ruby -e 'puts "hello".scan(/.?/).inspect'
["h", "e", "l", "l", "o", ""]

$ ruby -e 'puts "hello".scan(/.*/).inspect'
["hello", ""]

I thought I'd see what Oniguruma (5.8.0; with 1.1.0 gem) had to say:

irb> require 'oniguruma'
=> true
irb> reluctant = Oniguruma::ORegexp.new('.*?')
=> /.*?/
irb> greedy = Oniguruma::ORegexp.new('.*')
=> /.*/
irb> greedyq = Oniguruma::ORegexp.new('.?')
=> /.?/
irb> reluctant.scan("hello")
=> [#<MatchData:0x10b9aa4>, #<MatchData:0x10b9a7c>, #<MatchData:
0x10b9a68>, #<MatchData:0x10b9a40>, #<MatchData:0x10b9a18>,
#<MatchData:0x10b99f0>]
irb> reluctant.scan("hello").map{|md|md[0]}
=> ["", "", "", "", "", ""]
irb> greedy.scan("hello").map{|md|md[0]}
=> ["hello", ""]
irb> greedyq.scan("hello").map{|md|md[0]}
=> ["h", "e", "l", "l", "o", ""]

OK, the same result as the ruby Regexp.  Including, that .*? produces
[""]*6 which is the "before each character and at the end" locations
of the zero-length matches from perl, but the individual single-byte
matches are missing.

I presume that there's some justification for these behaviors, but I
can't figure out what it might be.

-Rob

> But look at:
>
> irb(main):038:0> "hello".scan /h(.*)e/
> => [[""]]
> irb(main):039:0> "hello".scan /h(.*)(.*)(.*)(.*)(.*)e/
> => [["", "", "", "", ""]]
>
> Here 'nothing' matches many times, and definitely this *is* the
> expected
> behaviour.

I agree that those results are exactly what I'd expect.

> --
> No virus found in this outgoing message.
> Checked by 'grep -i virus $MESSAGE'
> Trust me.

Rob B.    http://agileconsultingllc.com
removed_email_address@domain.invalid
This topic is locked and can not be replied to.