Forum: Ruby Regexp gotcha

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Pistos C. (Guest)
on 2006-03-28 19:23
Hi, all.  I was fixing a bug last night, and discovered some
"gotcha"-like behaviour in the process.  Consider:

irb(main):173:0> s = "my string"
=> "my string"
irb(main):174:0> r1 = /my/
=> /my/
irb(main):175:0> r2 = /your/
=> /your/
irb(main):176:0> r3 = nil
=> nil
irb(main):177:0> s =~ r1
=> 0
irb(main):178:0> s =~ r2
=> nil
irb(main):179:0> s =~ r3
=> false

s =~ r1  .... That's cool, it gives me the index of the match.
s =~ r2  .... That's cool, it tells me there was no match.
s =~ r3  .... Whoa.

The reason this "got me" is that I had this code:

match_result = ( some_string =~ some_regexp )
if match_result != nil
    # Assume there was a match
end

But the problem is... I had an s =~ r3 case because some_regexp was nil,
and so it was entering my if block when I semantically did not want that
to occur.  :(

So now my code is

if match_result != nil and match_result != false
...
end

Note also that I can't even use Regexp.last_match != nil in my if block:

irb(main):224:0> s =~ r2
=> nil
irb(main):225:0> Regexp.last_match
=> nil
irb(main):226:0> s =~ r3
=> false
irb(main):227:0> Regexp.last_match
=> nil
irb(main):228:0> s =~ r1
=> 0
irb(main):229:0> Regexp.last_match
=> #<MatchData:0x406c40b4>
irb(main):230:0> s =~ r3
=> false
irb(main):231:0> Regexp.last_match
=> #<MatchData:0x406c40b4>

To be clear: Note how a "nil type" of non-match overwrites last_match,
but a "false type" of non-match doesn't.

So the question is... why are BOTH nil and false possible return values
of =~ ?  Is there some benefit to this?  Why not just one or the other?

I see that this behaviour is [documented][1] but I still feel that this
is unintuitive behaviour when people assume =~ only applies to Regexp
RHS's.

[1]: http://www.ruby-doc.org/core/classes/String.html#M001453

Thanks in advance for any and all clarifications and explanations.

Pistos
Victor S. (Guest)
on 2006-03-28 19:30
(Received via mailing list)
> So now my code is
>
> if match_result != nil and match_result != false
> ...
> end

AFAIK, it's an equivalent for simple

if match_result
...
end

because in Ruby only nil and false are "false", where any other value
(including 0, '' and []) are "true".

Victor.
unknown (Guest)
on 2006-03-28 19:33
(Received via mailing list)
Hi --

On Wed, 29 Mar 2006, Pistos C. wrote:

> => nil
>
>
> So now my code is
>
> if match_result != nil and match_result != false

You could shorten that to:

   if match_result

I'm not sure about the "why" part of the nil/false thing.


David

--
David A. Black (removed_email_address@domain.invalid)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

"Ruby for Rails" chapters now available
from Manning Early Access Program! http://www.manning.com/books/black
Kevin O. (Guest)
on 2006-03-28 19:40
(Received via mailing list)
You could just do this...

if string =~ /(\w)/
  #do something with $1
end

anytime you try to match a string to a non-regexp object you get a
false, I think.

_Kevin
Pistos C. (Guest)
on 2006-03-28 19:48
Victor S. wrote:
> if match_result
> ...
> end
>
> because in Ruby only nil and false are "false", where any other value
> (including 0, '' and []) are "true".

Yep, thank you to you and David.  I forgot that I could rewrite it like
that.

Kevin wrote:
> You could just do this...
> if string =~ /(\w)/
>   #do something with $1
> end

Well, in this particular case, I am using the Fixnum returned, which is
why I am making the assignment.  I normally otherwise do as you say,
using "if string =~ /regexp/".

Pistos
Robert K. (Guest)
on 2006-03-28 22:55
(Received via mailing list)
Pistos C. wrote:
>
> Kevin wrote:
>> You could just do this...
>> if string =~ /(\w)/
>>   #do something with $1
>> end
>
> Well, in this particular case, I am using the Fixnum returned, which is
> why I am making the assignment.  I normally otherwise do as you say,
> using "if string =~ /regexp/".

Personally I prefer to use /rx/ =~ str over str =~ /rx/ - to me this
makes it clearer that the RX is the one that does the matching.  Just
personal taste maybe but I think I also remember that that variant is a
tad faster.

Kind regards

	robert
James H. (Guest)
on 2006-03-29 00:54
(Received via mailing list)
Are you asking why we can write

  if match_result

as equivalent to

  if match_result != nil and match_result != false

?

James H.
Pistos C. (Guest)
on 2006-03-29 19:44
James H. wrote:
> Are you asking why we can write
>
>   if match_result
>
> as equivalent to
>
>   if match_result != nil and match_result != false
>
> ?

No, not at all.  :)  I'm just whining a bit that =~ can return both nil
and false.  It's not that big a deal, but this is something that could
catch unaware people who would follow down the same tracks as I did and
assume match_result != nil would cover all the bases, when it doesn't.

Robert K. wrote:
> Personally I prefer to use /rx/ =~ str over str =~ /rx/ - to me this
> makes it clearer that the RX is the one that does the matching.  Just
> personal taste maybe but I think I also remember that that variant is a
> tad faster.

I didn't even realize this could be done :) (though I see now that it is
documented, =~ being a synonym for Regexp#match).  If I lived in a
vacuum for the last 15 years, and Ruby was the first and only
programming language I ever learned, then I would have done it that way,
too, from the very start.  :)  Alas, I came from [other languages and
then] Perl, so it was just a carry over to continue using str =~
/regexp/.

FWIW, we still have the same problem:

irb(main):255:0> r1 =~ s
=> 0
irb(main):256:0> r2 =~ s
=> nil
irb(main):257:0> r3 =~ s
=> false

I've taken note that you say that r =~ s is faster.  I (or someone else)
will have to do some benchmarking to see whether that's really true, and
how much speed gain can be had.  Diakonos suffers when you use large and
many regexps for syntax highlighting, so I'd be interested in anything
that can speed that up.

Pistos
Robert D. (Guest)
on 2006-03-29 20:29
(Received via mailing list)
On 3/29/06, Pistos C. <removed_email_address@domain.invalid> wrote:
> I've taken note that you say that r =~ s is faster.  I (or someone else)
> Pistos
I do not want to enter into the discussion if it should be like that or
not.
The beauty of ruby is that you can change a lot of things if you do not
like
them.
In your case, maybe you like the following
----------------------------------------------------------------
class String
    alias_method :__old_match, :=~
    def =~(obj)
        raise RuntimeError,
               "#{obj.nil? ? "nil" : obj.to_s} is not a Regexp but #{
obj.class}" unless Regexp === obj
        __old_match obj
    end # def =~(obj)
end # class String

puts "x" =~ %r{.}
puts "x" =~ nil

------------------------------

Hope this is helpfull

Cheers
Robert

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
Pistos C. (Guest)
on 2006-03-29 20:40
Robert D. wrote:
> class String
>     alias_method :__old_match, :=~
>     def =~(obj)
>         raise RuntimeError,
>                "#{obj.nil? ? "nil" : obj.to_s} is not a Regexp but #{
> obj.class}" unless Regexp === obj
>         __old_match obj
>     end # def =~(obj)
> end # class String

Thanks for the suggestion, Robert.

While I have no qualms about extending core classes, this sort of
adjustment feels like too... "brash"? of a modification for me.  :)

In this particular case, rewriting my if line is an acceptable solution
to the problem.

Pistos
unknown (Guest)
on 2006-03-29 21:08
(Received via mailing list)
Hi --

On Thu, 30 Mar 2006, Pistos C. wrote:

>
> Thanks for the suggestion, Robert.
>
> While I have no qualms about extending core classes, this sort of
> adjustment feels like too... "brash"? of a modification for me.  :)

You should have *some* qualms :-)  But in any case, this change is
certainly one I wouldn't make.  I have no way of knowing whether
someone has done this somewhere:

   if str =~ re

and not tested for re being nil because it doesn't affect the outcome
of the test.


David

--
David A. Black (removed_email_address@domain.invalid)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

"Ruby for Rails" chapters now available
from Manning Early Access Program! http://www.manning.com/books/black
Jacob F. (Guest)
on 2006-03-29 21:26
(Received via mailing list)
On 3/29/06, removed_email_address@domain.invalid 
<removed_email_address@domain.invalid> wrote:
> You should have *some* qualms :-)  But in any case, this change is
> certainly one I wouldn't make.  I have no way of knowing whether
> someone has done this somewhere:
>
>    if str =~ re
>
> and not tested for re being nil because it doesn't affect the outcome
> of the test.

In fact, my initial guess[1] was that String#=~ is defined something
like:

  class String
    def =~( other )
      other =~ self
    end
  end

And of course the default Object#=~ is:

  class Object
    def =~( other )
      false
    end
  end

(observed by the fact that Object.new =~ Object.new returns false
without raising NoMethodException).

This is useful, because someone might come along and define #=~ for
their new class Foo which is close enough semantically to a Regexp to
merit the operator overload, but is not a  Regexp. ri tells me that
REXML::Light::Node#=~ exists, for example. Doing this would cause that
to break.

-- Jacob F.

[1] The actual behavior of String#=~ is documented (from ri):

     If _obj_ is a +Regexp+, use it as a pattern to match
     against _str_. If _obj_ is a +String+, look for it in _str_
     (similar to +String#index+). Returns the position the match starts,
     or +nil+ if there is no match. Otherwise, invokes _obj.=~_, passing
     _str_ as an argument. The default +=~+ in +Object+ returns +false+.

However, trying string ~ another_string gives me an exception in irb:

  $ irb
  >> "looking for bob..." =~ "bob"
  TypeError: type mismatch: String given
        from (irb):1:in `=~'
        from (irb):1

Is this a documentation bug?
Andrew J. (Guest)
on 2006-03-29 21:36
(Received via mailing list)
On Thu, 30 Mar 2006 02:24:41 +0900, Jacob F. 
<removed_email_address@domain.invalid>
wrote:

[snip]
> However, trying string ~ another_string gives me an exception in irb:
>
>   $ irb
>  >> "looking for bob..." =3D~ "bob"
>   TypeError: type mismatch: String given
>         from (irb):1:in `=3D~'
>         from (irb):1
>
> Is this a documentation bug?

Yes, I submitted a patch last week via the rubyforge bug form.
Basically,
String =~ String was deprecated in 1.8.0 / 1.8.1 (generated obsolete
warning), and generates a TypeError since 1.8.2. I have no idea on the
status of the patch.

andrew
Jacob F. (Guest)
on 2006-03-29 21:42
(Received via mailing list)
On 3/29/06, Andrew J. <removed_email_address@domain.invalid> wrote:
> On Thu, 30 Mar 2006 02:24:41 +0900, Jacob F. <removed_email_address@domain.invalid> 
wrote:
> > Is this a documentation bug?
>
> Yes, I submitted a patch last week via the rubyforge bug form. Basically,
> String =~ String was deprecated in 1.8.0 / 1.8.1 (generated obsolete
> warning), and generates a TypeError since 1.8.2. I have no idea on the
> status of the patch.

Ok, thanks. Glad to know there's already been a patch submitted.

Jacob F.
Robert D. (Guest)
on 2006-03-30 00:05
(Received via mailing list)
>    if str =~ re
>
> and not tested for re being nil because it doesn't affect the outcome
> of the test.
>
>
> David


Now I am very much against extending  core classes especally in commonly
used modules.
There was an interesting discussion  about Rails doing so to an extreme
extend.
However I do not hesitate a second to do it if I feel it gives me
consistent
behaviour throughout my applications especially when I change some
rather
"strange" behaviour.
It is always a good idea to be ready to break "rules" if there are good
reasons to do so.
This might as well be such a case.
Please note, and that is important, that my modification will break code
clearly in a well defined manner.
It all comes down to the following decision you have to make:
(A) Somebody might rely on
   "a" =~ nil returning false
in that case abandon!!! (and ask yourself why you have posted that
question!)

(B) Matching with nil is probably a mistake that will break the code
later,
in other words your astonishement about that behaviour will be "common
sense", in that case the above extension of a core class is just a great
idea.

Robert

--
> David A. Black (removed_email_address@domain.invalid)
> Ruby Power and Light, LLC (http://www.rubypowerandlight.com)
>
> "Ruby for Rails" chapters now available
> from Manning Early Access Program! http://www.manning.com/books/black
>
>


--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
Jacob F. (Guest)
on 2006-03-30 02:46
(Received via mailing list)
On 3/29/06, Robert D. <removed_email_address@domain.invalid> wrote:
> (A) Somebody might rely on
>    "a" =~ nil returning false
> in that case abandon!!! (and ask yourself why you have posted that
> question!)

I don't think anyone's code will rely on "a" =~ nil returning false.
However, it's possible and even likely that someone's code will rely
on "a" =~ not_a_regex not raising an exception For instance,
REXML::Light::Node supports #=~, as mentioned in my previous post. I
wouldn't take kindly to an "extension" that breaks REXML.

> (B) Matching with nil is probably a mistake that will break the code later,
> in other words your astonishement about that behaviour will be "common
> sense", in that case the above extension of a core class is just a great
> idea.

I agree matching with nil is probably never the intended usage.
Perhaps String#=~ should be patched specifically to not allow nil. The
current behavior is a side effect of nil being an Object and thus
inheriting Object#=~.

However, I disagree with the sweeping "extension" whereby anything
that's not a Regexp causes String#=~ to raise an exception. I take
this as an example of how tricky it can be to *safely* "extend" the
core modules. This doesn't mean you can't do it, just that you should
be *very careful.

Jacob F.
unknown (Guest)
on 2006-03-30 03:04
(Received via mailing list)
Hi --

On Thu, 30 Mar 2006, Robert D. wrote:

> It all comes down to the following decision you have to make:
> (A) Somebody might rely on
>   "a" =~ nil returning false
> in that case abandon!!! (and ask yourself why you have posted that
> question!)
>
> (B) Matching with nil is probably a mistake that will break the code later,
> in other words your astonishement about that behaviour will be "common
> sense", in that case the above extension of a core class is just a great
> idea.

(B) is where the dangers lie. You may think that someone who uses a
particular feature of Ruby is not programming well, but still, it's
only yourself that you punish by making your code incompatible with
the language :-)


David

--
David A. Black (removed_email_address@domain.invalid)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

"Ruby for Rails" chapters now available
from Manning Early Access Program! http://www.manning.com/books/black
Robert D. (Guest)
on 2006-03-30 12:24
(Received via mailing list)
On 3/30/06, Jacob F. <removed_email_address@domain.invalid> wrote:
> REXML::Light::Node supports #=~, as mentioned in my previous post. I
> wouldn't take kindly to an "extension" that breaks REXML.


So you strongly oppose Rails?

>
> However, I disagree with the sweeping "extension" whereby anything
> that's not a Regexp causes String#=~ to raise an exception. I take
> this as an example of how tricky it can be to *safely* "extend" the
> core modules. This doesn't mean you can't do it, just that you should
> be *very careful.


That was just a toy extension to  pass the idea  of what ruby can do to
the
poster.
Than I found myself mildly attacked about the  extension stuff, so I got
mildely defensive.
It is *not* fair to judge the whole idea of extending or not extending
core
objects by a simple example that was written to show the power of ruby.

I am however open for discussion if that power of ruby shall be
"removed",
"restricted" or kept.
I think that -w could do things about it or SAVE levels.
Probably worth a different thread.
But I am not Matz :(

Cheers
Robert

>
>


--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein
Jacob F. (Guest)
on 2006-03-30 21:14
(Received via mailing list)
On 3/30/06, Robert D. <removed_email_address@domain.invalid> wrote:
> On 3/30/06, Jacob F. <removed_email_address@domain.invalid> wrote:
> > I don't think anyone's code will rely on "a" =~ nil returning false.
> > However, it's possible and even likely that someone's code will rely
> > on "a" =~ not_a_regex not raising an exception For instance,
> > REXML::Light::Node supports #=~, as mentioned in my previous post. I
> > wouldn't take kindly to an "extension" that breaks REXML.
>
> So you strongly oppose Rails?

I wasn't aware Rails (or probably more particularly, ActiveSupport)
broke REXML. If so, then yes I'd strongly oppose that particular
extension within ActiveSupport/Rails that breaks REXML.

> > However, I disagree with the sweeping "extension" whereby anything
> > that's not a Regexp causes String#=~ to raise an exception. I take
> > this as an example of how tricky it can be to *safely* "extend" the
> > core modules. This doesn't mean you can't do it, just that you should
> > be *very careful.
>
> That was just a toy extension to  pass the idea  of what ruby can do to the
> poster. Than I found myself mildly attacked about the  extension stuff, so I got
> mildely defensive. It is *not* fair to judge the whole idea of extending or not 
extending
> core objects by a simple example that was written to show the power of ruby.

I understand that. I just thought it a good opportunity to point out
how easy it is to make mistakes like this. I make them all the time as
well. I was not judging the general idea of extending core objects --
I, like you, would be very put out if the power were removed. I like
to play/use it and I've seen some very nifty and useful things done
with it. I was just pointing out how easy it is to break other
people's code if you extend core classes without being careful.

And don't worry, we weren't attacking you or your coding ability by
pointing out the flaws in the code you proposed as an example. Just
trying to make sure the bases were covered. Please keep experimenting
and posting your ideas; and when you find flaws in my code, enlighten
me please! :)

Jacob F.
This topic is locked and can not be replied to.