Negate a character sequence in a regular expression?

unknown · November 29, 2007, 11:32pm

For the following string:

‘cat sheep horse cat tac dog’

I would like to write a regular expression that matches any substring
that is prefixed by the word ‘cat’, is then followed by any characters
as long as those characters do not comprise the word ‘cat’, and then
finally suffixed by the string ‘dog’. Therefore, this expression
should match the substring ‘cat tac dog’ in the above string.

Obviously, if I write an expression like:

irb(main):002:0> /cat.*dog/.match(‘cat sheep horse cat tac dog’).to_s
=> “cat sheep horse cat tac dog”

it will match the entire string.

And the non-greedy Kleene doesn’t buy me anything either since the
expression matches the first cat found anyway:

irb(main):003:0> /cat.*?dog/.match(‘cat sheep horse cat tac dog’).to_s
=> “cat sheep horse cat tac dog”

What I think I want to do is to negate a sequence of characters,
rather than just a character class, but I have looked around and not
found anything quite right.

Of course, there are ways of hacking this out, e.g. I could reverse
the string first and match ‘god’ followed by the first instance of
‘tac’, but I am hoping there is a more elegant way to do this with a
single regular expression.

Thanks–

unknown · November 30, 2007, 12:09am

I think this gets you close to where you want to be:

irb(main):045:0> ‘cat sheep horse cat tac dog’ =~ /cat(?!.cat)(.)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

On Nov 29, 2007 2:07 PM, [email protected] wrote:

For the following string:

‘cat sheep horse cat tac dog’

I would like to write a regular expression that matches any substring
that is prefixed by the word ‘cat’, is then followed by any characters
as long as those characters do not comprise the word ‘cat’, and then
finally suffixed by the string ‘dog’. Therefore, this expression
should match the substring ‘cat tac dog’ in the above string.

I think this gets you closer to where you want to be:

irb(main):045:0> ‘cat sheep horse cat tac dog’ =~ /cat(?!.cat)(.)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

unknown · November 30, 2007, 1:30am

On Nov 29, 3:47 pm, [email protected] wrote:

Obviously, if I write an expression like:
=> “cat sheep horse cat tac dog”
Thanks–
If you just want the right-most match of the substring prefixed by
‘cat’ and suffixed by ‘dog’, this should work: /.*(cat.*dog)/.match(s)
[1].

Regards,
Jordan

unknown · November 30, 2007, 2:25am

For the following string:

‘cat sheep horse cat tac dog’

I would like to write a regular expression that matches any substring
that is prefixed by the word ‘cat’, is then followed by any characters
as long as those characters do not comprise the word ‘cat’, and then
finally suffixed by the string ‘dog’. Therefore, this expression
should match the substring ‘cat tac dog’ in the above string.

Working out negative regular expressions is normally best avoided.

One step is hard. Two steps is not:

x = ‘cat sheep horse cat tac dog’
/(cat.*?dog)/.match(x) && $1.sub(/.*cat/,‘cat’)

Or if you want multiple matches:

x = ‘cat sheep horse cat tac dog cat cat sheep dog’
x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,‘cat’)}
=> [“cat tac dog”, “cat sheep dog”]

Dan.

unknown · November 30, 2007, 2:26am

On Nov 29, 5:06 pm, James M. [email protected] wrote:

should match the substring ‘cat tac dog’ in the above string.

I think this gets you closer to where you want to be:

irb(main):045:0> ‘cat sheep horse cat tac dog’ =~ /cat(?!.cat)(.)dog/
=> 16
irb(main):046:0> $1
=> " tac "
irb(main):047:0>

The (?! bit is a nonmatching lookahead.

Nonmatching (or negative) lookahead is what you want, and with some
adjustment of the capture you get:

‘cat sheep horse cat tac dog’ =~ /(cat(?!.*cat).*dog)/
=> 16
$1
=> “cat tac dog”

unknown · December 1, 2007, 9:19am

yermej wrote:

Nonmatching (or negative) lookahead is what you want, and with some
adjustment of the capture you get:

‘cat sheep horse cat tac dog’ =~ /(cat(?!.*cat).*dog)/
=> 16
$1
=> “cat tac dog”

Negative lookaheads that contain ‘.*’ are hard to comprehend (at least
for me). It is enough to add a ‘cat’ at the end and the regexp does not
find any more the ‘cat tac dog’ that should be matched:

‘cat sheep horse cat tac dog lion cat’ =~ /(cat(?!.*cat).*dog)/
=> nil

Also notice that, when it works, it will always give the last expression
present:
‘cat sheep horse cat tac dog lion cat dog’ =~ /(cat(?!.*cat).*dog)/
=> 33
p $1 # => “cat dog”

Daniel S. wrote:

Working out negative regular expressions is normally best avoided.

I would say that it is true when they contain ‘.*’ type expressions;
else they can be extremely useful.

Or if you want multiple matches:

x = ‘cat sheep horse cat tac dog cat cat sheep dog’
x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,‘cat’)}
=> [“cat tac dog”, “cat sheep dog”]

Very ingenious…

Raul

unknown · December 1, 2007, 10:16am

On Dec 1, 2:19 am, Raul P. [email protected] wrote:

Negative lookaheads that contain ‘.*’ are hard to comprehend (at least
p $1 # => “cat dog”

x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,‘cat’)}
=> [“cat tac dog”, “cat sheep dog”]

Very ingenious…

Raul

Posted viahttp://www.ruby-forum.com/.

Thanks, Raul, for the clarification on that.

To the original poster, I apologize for the misinformation. I guess
when I’m not completely sure about such things, I should start a new
thread, but attempting to answer questions here (some of which I do
get right) has been a great help to me in my own learning. I think
that starting new threads on all occasions would get to be too much
and I probably wouldn’t get many responses. In the future, I may just
keep quiet until I’m either sure I have the correct answer or at least
don’t understand why my answer isn’t correct.

unknown · December 1, 2007, 10:30am

On Dec 1, 3:10 am, yermej [email protected] wrote:

=> 16
Also notice that, when it works, it will always give the last expression

Posted viahttp://www.ruby-forum.com/.
don’t understand why my answer isn’t correct.
Not to blow my own horn, but I think the behavior requested by the OP
is still /.*(cat.*dog)/…(“a regular expression that matches any
substring
that is prefixed by the word ‘cat’, is then followed by any characters
as long as those characters do not comprise the word ‘cat’, and then
finally suffixed by the string ‘dog’”).

But don’t feel like you have to be 100% correct in order to help out.
My answers are often bass-ackwards (not really a bragging point, heh).
But the point is, like you say, to learn from each other and help
where we can. None of us knows it all; all we can do is our best, and
pray it might help someone here or there.

Regards,
Jordan

unknown · December 1, 2007, 10:35am

In article
[email protected],
[email protected] writes:

For the following string:

‘cat sheep horse cat tac dog’

I would like to write a regular expression that matches any substring
that is prefixed by the word ‘cat’, is then followed by any characters
as long as those characters do not comprise the word ‘cat’, and then
finally suffixed by the string ‘dog’. Therefore, this expression
should match the substring ‘cat tac dog’ in the above string.

% /usr/bin/ruby -e ‘p /cat((?!cat).)*dog/.match(“cat sheep horse cat tac
dog”).to_s’
“cat tac dog”

unknown · December 1, 2007, 9:22pm

yermej wrote:

Raul

Posted viahttp://www.ruby-forum.com/.

Thanks, Raul, for the clarification on that.

To the original poster, I apologize for the misinformation.
In the future, I may just keep quiet until I’m either sure I have
the correct answer or at least
don’t understand why my answer isn’t correct.

yermej,

I often follow your postings, that I find very interesting. But do not
take hard this; no one can be perfect all of the time.

In this case, it is very easy to fall in the alluring trap of the
negative lookahead with ‘.*’ (see how many people keep posting a variant
of that solution… ?!).

However, I totally share one’s feeling of dismay when we give an advice
that turns out not to be right (or just not completely right); we just
have to accept that we are fallible.

In this case hats off to Daniel S., who showed us how to do it (I
rewrite it to reinstate it, after all the emails that followed):

x = ‘cat sheep horse cat tac dog cat cat sheep dog cat’

x.scan(/cat.*?dog/).map {|x| x.sub(/.*cat/,‘cat’)}

=> [“cat tac dog”, “cat sheep dog”]

Absolutely brilliant

Raul

unknown · December 2, 2007, 10:15am

unknown wrote:

I also tweaked Tanaka’s solution a little bit after a trip back to the
pickaxe book. I added the (?: to keep the group from generating
backreferences,
Great

and I added an alternate ‘dog’ to the negative assertion.
Just using a non-greedy expression removes this complication (as you
realized later), although it made the following even more interesting.

Here is my attempt at breaking it down, for anyone who is interested:
/cat(?:(?!cat|dog).)*dog/
1 2 3 4 5 6

The engine starts scanning through the string until matches ‘cat’.

At the first position…

This analysis was great (I had missed Tanaka’s solution).

…In any event, I think I still like Daniel’s solution better,
because I can look at it and feel fairly certain that it will do
exactly what it should do.

Josh
I think both solutions are indestructible. I was curious to benchmark
them; here, they showed interesting differences:

a) for one-line sentences, where each substring cat…dog contains a few
words, Tanaka’s solution is approx 25% faster.

b) for long paragraphs where cat…dog are separated by (say) 2 dozen
words, Daniels’ solution becomes faster by 20-25% (for longer intervals,
the advantage can go up to 50% and more).

c) For long paragraphs, where cat and dog were separated by only a few
words, the performance is almost identical. So, it is not so much the
length of the string, but the frequency of appearance of the keywords
that is important.

[Note: you need to add the /m modifier if your strings are paragraphs.
In fact, it may be better to always write /m, so that you do not need to
touch the regexp].

The reason for the performance discrepancies is clear: for short gaps
between the keywords (ie cat, dog), the trio scan+map+gsub exacts a
toll. But for long paragraphs with many words between the keywords, it
is instead the negative lookahead on ‘cat’ (at every character) which
suffers.

You may want to take this in account, depending on the type of text
configurations you expect.

In any case, both methods seem technically perfect.

All the best, and thanks for that great summary,

Raul

unknown · December 2, 2007, 1:43am

Hello all - this was my first post, so thank you James, Jordan,
yermej, Daniel, Raul and Tanaka. I really appreciate the help. I’ve
learned quite a bit - especially about negative lookahead. Tricky
stuff.

As you might expect, I am working on a script to scan a big file for
multiple cat…dog pairs. Well not cats and dogs, more like time
stamps, etc.

I’ve tried all the solutions posted here and they work very nicely for
the string I posted. But I botched the original specification in at
least two ways:

I should have mentioned that there might be multiple cat…dog
pairs. Then James would probably have given me something looking like
Tanaka’s solution, which uses negative lookahead but only consumes one
character at a time.
I should have mentioned that the characters between cat and dog
should contain neither ‘cat’ nor ‘dog’. I’m really looking for the
‘innermost’ cat…dog pair.

So I tried Daniel’s solution and that works pretty well. If there is
a way to break it I did not find it.

I also tweaked Tanaka’s solution a little bit after a trip back to the
pickaxe book. I added the (?: to keep the group from generating
backreferences, and I added an alternate ‘dog’ to the negative
assertion. This is confusing but it seems to work:

irb(main):001:0> x = 'cat sheep horse cat tac dog cat cat sheep dog
dog ’
=> "cat sheep horse cat tac dog cat cat sheep dog dog "
irb(main):002:0> x.scan(/cat(?:(?!cat|dog).)*dog/)
=> [“cat tac dog”, “cat sheep dog”]

Here is my attempt at breaking it down, for anyone who is interested:
/cat(?:(?!cat|dog).)*dog/
1 2 3 4 5 6

The engine starts scanning through the string until it matches
‘cat’.
At the first position after the t in cat, the engine tries to
match this expression (?:(?!cat|dog).), zero or more times. The (?:
is nothing special. It is just a grouping that does not generate
backreferences.
The negative lookahead assertion. If the regular expression
following the ! matches the string, starting from the current position
of the engine, then the expression to which the assertion belongs will
fail. But since the negative assertion belongs to a grouping that can
match 0 or more times, the assertion can fail but the full regular
expression can succeed.
The hard part for me is that the negative lookahead assertion does
not consume any characters. It looks ahead, tries to match its
regular expression, and whether it returns ‘MATCH’ or ‘NO MATCH’, the
current position of the engine stays the same.
The regexp for the negative assertion. The stuff between each
cat…dog pair should contain neither ‘cat’ nor ‘dog’.
One character is consumed, the engine moves down the string one
position. Thanks to the negative lookahead, we know that that the
consumed character is not the start of a ‘cat’ or ‘dog’ substring.
Eventually, the (?:(?!cat|dog).)* fails when the current position
of the engine reaches the beginning of a ‘dog’ substring. But then
the last part of the regular expression (6) will match.

Also, another way of getting the ‘innermost’ cat…dog pair is to use
non-greedy kleene star(*?). This way, the engine will take the first
‘dog’ suffix that it finds, rather than the last possible:

irb(main):003:0> x.scan(/cat(?:(?!cat).)*?dog/)
=> [“cat tac dog”, “cat sheep dog”]

So there is more than one way to scan a cat! (Sorry!!!)

…In any event, I think I still like Daniel’s solution better,
because I can look at it and feel fairly certain that it will do
exactly what it should do.

Thanks–
Josh