Regexp gotcha

pistos · March 28, 2006, 5:23pm

Hi, all. I was fixing a bug last night, and discovered some
“gotcha”-like behaviour in the process. Consider:

irb(main):173:0> s = “my string”
=> “my string”
irb(main):174:0> r1 = /my/
=> /my/
irb(main):175:0> r2 = /your/
=> /your/
irb(main):176:0> r3 = nil
=> nil
irb(main):177:0> s =~ r1
=> 0
irb(main):178:0> s =~ r2
=> nil
irb(main):179:0> s =~ r3
=> false

s =~ r1 … That’s cool, it gives me the index of the match.
s =~ r2 … That’s cool, it tells me there was no match.
s =~ r3 … Whoa.

The reason this “got me” is that I had this code:

match_result = ( some_string =~ some_regexp )
if match_result != nil
# Assume there was a match
end

But the problem is… I had an s =~ r3 case because some_regexp was nil,
and so it was entering my if block when I semantically did not want that
to occur.

So now my code is

if match_result != nil and match_result != false
…
end

Note also that I can’t even use Regexp.last_match != nil in my if block:

irb(main):224:0> s =~ r2
=> nil
irb(main):225:0> Regexp.last_match
=> nil
irb(main):226:0> s =~ r3
=> false
irb(main):227:0> Regexp.last_match
=> nil
irb(main):228:0> s =~ r1
=> 0
irb(main):229:0> Regexp.last_match
=> #MatchData:0x406c40b4
irb(main):230:0> s =~ r3
=> false
irb(main):231:0> Regexp.last_match
=> #MatchData:0x406c40b4

To be clear: Note how a “nil type” of non-match overwrites last_match,
but a “false type” of non-match doesn’t.

So the question is… why are BOTH nil and false possible return values
of =~ ? Is there some benefit to this? Why not just one or the other?

I see that this behaviour is documented but I still feel that this
is unintuitive behaviour when people assume =~ only applies to Regexp
RHS’s.

Thanks in advance for any and all clarifications and explanations.

Pistos

pistos · March 28, 2006, 5:30pm

So now my code is

if match_result != nil and match_result != false
…
end

AFAIK, it’s an equivalent for simple

if match_result
…
end

because in Ruby only nil and false are “false”, where any other value
(including 0, ‘’ and []) are “true”.

Victor.

pistos · March 28, 2006, 5:33pm

Hi –

On Wed, 29 Mar 2006, Pistos C. wrote:

=> nil

So now my code is

if match_result != nil and match_result != false

You could shorten that to:

if match_result

I’m not sure about the “why” part of the nil/false thing.

David

–
David A. Black ([email protected])
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

“Ruby for Rails” chapters now available
from Manning Early Access Program! Ruby for Rails

pistos · March 28, 2006, 5:48pm

Victor S. wrote:

if match_result
…
end

because in Ruby only nil and false are “false”, where any other value
(including 0, ‘’ and []) are “true”.

Yep, thank you to you and David. I forgot that I could rewrite it like
that.

Kevin wrote:

You could just do this…
if string =~ /(\w)/
#do something with $1
end

Well, in this particular case, I am using the Fixnum returned, which is
why I am making the assignment. I normally otherwise do as you say,
using “if string =~ /regexp/”.

Pistos

pistos · March 28, 2006, 5:40pm

You could just do this…

if string =~ /(\w)/
#do something with $1
end

anytime you try to match a string to a non-regexp object you get a
false, I think.

_Kevin

pistos · March 28, 2006, 8:55pm

Pistos C. wrote:

Kevin wrote:

You could just do this…
if string =~ /(\w)/
#do something with $1
end

Well, in this particular case, I am using the Fixnum returned, which is
why I am making the assignment. I normally otherwise do as you say,
using “if string =~ /regexp/”.

Personally I prefer to use /rx/ =~ str over str =~ /rx/ - to me this
makes it clearer that the RX is the one that does the matching. Just
personal taste maybe but I think I also remember that that variant is a
tad faster.

Kind regards

robert

pistos · March 28, 2006, 10:54pm

Are you asking why we can write

if match_result

as equivalent to

if match_result != nil and match_result != false

?

James H.

pistos · March 29, 2006, 5:44pm

James H. wrote:

Are you asking why we can write

if match_result

as equivalent to

if match_result != nil and match_result != false

?

No, not at all. I’m just whining a bit that =~ can return both nil
and false. It’s not that big a deal, but this is something that could
catch unaware people who would follow down the same tracks as I did and
assume match_result != nil would cover all the bases, when it doesn’t.

Robert K. wrote:

Personally I prefer to use /rx/ =~ str over str =~ /rx/ - to me this
makes it clearer that the RX is the one that does the matching. Just
personal taste maybe but I think I also remember that that variant is a
tad faster.

I didn’t even realize this could be done (though I see now that it is
documented, =~ being a synonym for Regexp#match). If I lived in a
vacuum for the last 15 years, and Ruby was the first and only
programming language I ever learned, then I would have done it that way,
too, from the very start. Alas, I came from [other languages and
then] Perl, so it was just a carry over to continue using str =~
/regexp/.

FWIW, we still have the same problem:

irb(main):255:0> r1 =~ s
=> 0
irb(main):256:0> r2 =~ s
=> nil
irb(main):257:0> r3 =~ s
=> false

I’ve taken note that you say that r =~ s is faster. I (or someone else)
will have to do some benchmarking to see whether that’s really true, and
how much speed gain can be had. Diakonos suffers when you use large and
many regexps for syntax highlighting, so I’d be interested in anything
that can speed that up.

Pistos

pistos · March 29, 2006, 6:29pm

On 3/29/06, Pistos C. [email protected] wrote:

I’ve taken note that you say that r =~ s is faster. I (or someone else)
Pistos
I do not want to enter into the discussion if it should be like that or
not.
The beauty of ruby is that you can change a lot of things if you do not
like
them.
In your case, maybe you like the following

class String
alias_method :__old_match, :=~
def =~(obj)
raise RuntimeError,
“#{obj.nil? ? “nil” : obj.to_s} is not a Regexp but #{
obj.class}” unless Regexp === obj
__old_match obj
end # def =~(obj)
end # class String

puts “x” =~ %r{.}
puts “x” =~ nil

Hope this is helpfull

Cheers
Robert

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein

pistos · March 29, 2006, 6:40pm

Robert D. wrote:

class String
alias_method :__old_match, :=~
def =~(obj)
raise RuntimeError,
“#{obj.nil? ? “nil” : obj.to_s} is not a Regexp but #{
obj.class}” unless Regexp === obj
__old_match obj
end # def =~(obj)
end # class String

Thanks for the suggestion, Robert.

While I have no qualms about extending core classes, this sort of
adjustment feels like too… “brash”? of a modification for me.

In this particular case, rewriting my if line is an acceptable solution
to the problem.

Pistos

pistos · March 29, 2006, 7:08pm

Hi –

On Thu, 30 Mar 2006, Pistos C. wrote:

Thanks for the suggestion, Robert.

While I have no qualms about extending core classes, this sort of
adjustment feels like too… “brash”? of a modification for me.

You should have some qualms But in any case, this change is
certainly one I wouldn’t make. I have no way of knowing whether
someone has done this somewhere:

if str =~ re

and not tested for re being nil because it doesn’t affect the outcome
of the test.

David

–
David A. Black ([email protected])
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

“Ruby for Rails” chapters now available
from Manning Early Access Program! Ruby for Rails

pistos · March 29, 2006, 7:36pm

On Thu, 30 Mar 2006 02:24:41 +0900, Jacob F. [email protected]
wrote:

[snip]

However, trying string ~ another_string gives me an exception in irb:

$ irb

“looking for bob…” =3D~ “bob”
TypeError: type mismatch: String given
from (irb):1:in `=3D~’
from (irb):1

Is this a documentation bug?

Yes, I submitted a patch last week via the rubyforge bug form.
Basically,
String =~ String was deprecated in 1.8.0 / 1.8.1 (generated obsolete
warning), and generates a TypeError since 1.8.2. I have no idea on the
status of the patch.

andrew

pistos · March 29, 2006, 7:42pm

On 3/29/06, Andrew J. [email protected] wrote:

On Thu, 30 Mar 2006 02:24:41 +0900, Jacob F. [email protected] wrote:

Is this a documentation bug?

Yes, I submitted a patch last week via the rubyforge bug form. Basically,
String =~ String was deprecated in 1.8.0 / 1.8.1 (generated obsolete
warning), and generates a TypeError since 1.8.2. I have no idea on the
status of the patch.

Ok, thanks. Glad to know there’s already been a patch submitted.

Jacob F.

pistos · March 29, 2006, 7:26pm

On 3/29/06, [email protected] [email protected] wrote:

You should have some qualms But in any case, this change is
certainly one I wouldn’t make. I have no way of knowing whether
someone has done this somewhere:

if str =~ re

and not tested for re being nil because it doesn’t affect the outcome
of the test.

In fact, my initial guess[1] was that String#=~ is defined something
like:

class String
def =~( other )
other =~ self
end
end

And of course the default Object#=~ is:

class Object
def =~( other )
false
end
end

(observed by the fact that Object.new =~ Object.new returns false
without raising NoMethodException).

This is useful, because someone might come along and define #=~ for
their new class Foo which is close enough semantically to a Regexp to
merit the operator overload, but is not a Regexp. ri tells me that
REXML::Light::Node#=~ exists, for example. Doing this would cause that
to break.

– Jacob F.

[1] The actual behavior of String#=~ is documented (from ri):

 If _obj_ is a +Regexp+, use it as a pattern to match
 against _str_. If _obj_ is a +String+, look for it in _str_
 (similar to +String#index+). Returns the position the match starts,
 or +nil+ if there is no match. Otherwise, invokes _obj.=~_, passing
 _str_ as an argument. The default +=~+ in +Object+ returns +false+.

However, trying string ~ another_string gives me an exception in irb:

$ irb

“looking for bob…” =~ “bob”
TypeError: type mismatch: String given
from (irb):1:in `=~’
from (irb):1

Is this a documentation bug?

pistos · March 30, 2006, 12:46am

On 3/29/06, Robert D. [email protected] wrote:

(A) Somebody might rely on
“a” =~ nil returning false
in that case abandon!!! (and ask yourself why you have posted that
question!)

I don’t think anyone’s code will rely on “a” =~ nil returning false.
However, it’s possible and even likely that someone’s code will rely
on “a” =~ not_a_regex not raising an exception For instance,
REXML::Light::Node supports #=~, as mentioned in my previous post. I
wouldn’t take kindly to an “extension” that breaks REXML.

(B) Matching with nil is probably a mistake that will break the code later,
in other words your astonishement about that behaviour will be “common
sense”, in that case the above extension of a core class is just a great
idea.

I agree matching with nil is probably never the intended usage.
Perhaps String#=~ should be patched specifically to not allow nil. The
current behavior is a side effect of nil being an Object and thus
inheriting Object#=~.

However, I disagree with the sweeping “extension” whereby anything
that’s not a Regexp causes String#=~ to raise an exception. I take
this as an example of how tricky it can be to safely “extend” the
core modules. This doesn’t mean you can’t do it, just that you should
be *very careful.

Jacob F.

pistos · March 29, 2006, 10:05pm

if str =~ re

and not tested for re being nil because it doesn’t affect the outcome
of the test.

David

Now I am very much against extending core classes especally in commonly
used modules.
There was an interesting discussion about Rails doing so to an extreme
extend.
However I do not hesitate a second to do it if I feel it gives me
consistent
behaviour throughout my applications especially when I change some
rather
“strange” behaviour.
It is always a good idea to be ready to break “rules” if there are good
reasons to do so.
This might as well be such a case.
Please note, and that is important, that my modification will break code
clearly in a well defined manner.
It all comes down to the following decision you have to make:
(A) Somebody might rely on
“a” =~ nil returning false
in that case abandon!!! (and ask yourself why you have posted that
question!)

(B) Matching with nil is probably a mistake that will break the code
later,
in other words your astonishement about that behaviour will be “common
sense”, in that case the above extension of a core class is just a great
idea.

Robert

–

David A. Black ([email protected])
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

“Ruby for Rails” chapters now available
from Manning Early Access Program! Ruby for Rails

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein

pistos · March 30, 2006, 1:04am

Hi –

On Thu, 30 Mar 2006, Robert D. wrote:

It all comes down to the following decision you have to make:
(A) Somebody might rely on
“a” =~ nil returning false
in that case abandon!!! (and ask yourself why you have posted that
question!)

(B) Matching with nil is probably a mistake that will break the code later,
in other words your astonishement about that behaviour will be “common
sense”, in that case the above extension of a core class is just a great
idea.

(B) is where the dangers lie. You may think that someone who uses a
particular feature of Ruby is not programming well, but still, it’s
only yourself that you punish by making your code incompatible with
the language

David

–
David A. Black ([email protected])
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

“Ruby for Rails” chapters now available
from Manning Early Access Program! Ruby for Rails

pistos · March 30, 2006, 10:24am

On 3/30/06, Jacob F. [email protected] wrote:

REXML::Light::Node supports #=~, as mentioned in my previous post. I
wouldn’t take kindly to an “extension” that breaks REXML.

So you strongly oppose Rails?

However, I disagree with the sweeping “extension” whereby anything
that’s not a Regexp causes String#=~ to raise an exception. I take
this as an example of how tricky it can be to safely “extend” the
core modules. This doesn’t mean you can’t do it, just that you should
be *very careful.

That was just a toy extension to pass the idea of what ruby can do to
the
poster.
Than I found myself mildly attacked about the extension stuff, so I got
mildely defensive.
It is not fair to judge the whole idea of extending or not extending
core
objects by a simple example that was written to show the power of ruby.

I am however open for discussion if that power of ruby shall be
“removed”,
“restricted” or kept.
I think that -w could do things about it or SAVE levels.
Probably worth a different thread.
But I am not Matz

Cheers
Robert

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein

pistos · March 30, 2006, 7:14pm

On 3/30/06, Robert D. [email protected] wrote:

On 3/30/06, Jacob F. [email protected] wrote:

I don’t think anyone’s code will rely on “a” =~ nil returning false.
However, it’s possible and even likely that someone’s code will rely
on “a” =~ not_a_regex not raising an exception For instance,
REXML::Light::Node supports #=~, as mentioned in my previous post. I
wouldn’t take kindly to an “extension” that breaks REXML.

So you strongly oppose Rails?

I wasn’t aware Rails (or probably more particularly, ActiveSupport)
broke REXML. If so, then yes I’d strongly oppose that particular
extension within ActiveSupport/Rails that breaks REXML.

However, I disagree with the sweeping “extension” whereby anything
that’s not a Regexp causes String#=~ to raise an exception. I take
this as an example of how tricky it can be to safely “extend” the
core modules. This doesn’t mean you can’t do it, just that you should
be *very careful.

That was just a toy extension to pass the idea of what ruby can do to the
poster. Than I found myself mildly attacked about the extension stuff, so I got
mildely defensive. It is not fair to judge the whole idea of extending or not extending
core objects by a simple example that was written to show the power of ruby.

I understand that. I just thought it a good opportunity to point out
how easy it is to make mistakes like this. I make them all the time as
well. I was not judging the general idea of extending core objects –
I, like you, would be very put out if the power were removed. I like
to play/use it and I’ve seen some very nifty and useful things done
with it. I was just pointing out how easy it is to break other
people’s code if you extend core classes without being careful.

And don’t worry, we weren’t attacking you or your coding ability by
pointing out the flaws in the code you proposed as an example. Just
trying to make sure the bases were covered. Please keep experimenting
and posting your ideas; and when you find flaws in my code, enlighten
me please!

Jacob F.