Regular expression negate a word (not character)

winterheat · January 26, 2008, 2:20am

somebody who is a regular expression guru… how do you negate a word
and grep for all words that is

tire

but not

snow tire

or

snowtire

so for example, it will grep for

winter tire
tire
retire
tired

but will not grep for

snow tire
snow tire
some snowtires

need to do it in one regular expression

winterheat · January 26, 2008, 3:20am

On Jan 25, 5:16 pm, Summercool [email protected] wrote:

snowtire

i could think of something like

/[^s][^n][^o][^w]\s*tire/i

but what if it is not snow but some 20 character-word, then do we need
to do it 20 times to negate it? any shorter way?

winterheat · January 26, 2008, 3:35am

SpringFlowers AutumnMoon wrote:

On Jan 25, 5:16 pm, Summercool [email protected] wrote:

snowtire

i could think of something like

/[^s][^n][^o][^w]\s*tire/i

but what if it is not snow but some 20 character-word, then do we need
to do it 20 times to negate it? any shorter way?

I took a long look at this and I came up with a number of different
methods, including an idea like the one you have above. If you have a
set number of bad/undesirable words then everything falls apart. I
tried negative look behinds but those don’t work well with 0 or more
spaces because look-behinds have to have a fixed length. I really don’t
think that this could be done elegantly with a single regular expression
if you have multiple bad/undesirable words. However, if you split this
into two regular expressions then it becomes rather straightforward.

I really have spent the last 20 minutes trying out different
possibilities with a single regular expressions but it just doesn’t seem
worth the difficulty =(

May I ask why there is the requirement for a single regular expression?

Joe P

winterheat · January 26, 2008, 3:44am

On Jan 25, 2008 6:19 PM, Summercool [email protected] wrote:

or

snowtire

i could think of something like

/[^s][^n][^o][^w]\s*tire/i

but what if it is not snow but some 20 character-word, then do we need
to do it 20 times to negate it? any shorter way?

(?!snow)(\S{4})\s*(tire)|^\S{0,3}\s*(tire)

I’m not thrilled with that, but without look-behind, it’s rough to do
what you’re asking.

Shameless pluggery: I used RegexpBench to do the experimentation to
find your answer.

Judson

winterheat · January 26, 2008, 4:30am

On Jan 25, 6:35 pm, Joseph P. [email protected] wrote:

I really have spent the last 20 minutes trying out different
possibilities with a single regular expressions but it just doesn’t seem
worth the difficulty =(

May I ask why there is the requirement for a single regular expression?

Joe P

thanks for your post. a reason is that some text editor lets users
search all files using a regular expression… another reason is
that… if 2 lines are used to test… then what if that line actually
has tire and snowtire… then it may negate the whole line as a
result, even though we want to grep it due to the first word “tire”.

winterheat · January 26, 2008, 5:46am

“Summercool” [email protected] wrote in message
news:[email protected]…

or
but will not grep for

snow tire
snow tire
some snowtires

need to do it in one regular expression

What you want is a negative lookbehind assertion:

re.search(r’(?<!snow)tire’,‘snowtire’) # no match
re.search(r’(?<!snow)tire’,‘baldtire’)
<_sre.SRE_Match object at 0x00FCD608>

Unfortunately you want variable whitespace:

re.search(r’(?<!snow\s*)tire’,‘snow tire’)
Traceback (most recent call last):
File “”, line 1, in
File “C:\dev\python\lib\re.py”, line 134, in search
return _compile(pattern, flags).search(string)
File “C:\dev\python\lib\re.py”, line 233, in _compile
raise error, v # invalid expression
error: look-behind requires fixed-width pattern

Python doesn’t support lookbehind assertions that can vary in size.
This
doesn’t work either:

re.search(r’(?<!snow)\s*tire’,‘snow tire’)
<_sre.SRE_Match object at 0x00F93480>

Here’s some code (not heavily tested) that implements a variable
lookbehind
assertion, and a function to mark matches in a string to demonstrate it:

BEGIN CODE

import re

def finditerexcept(pattern,notpattern,string):
for matchobj in
re.finditer(‘(?:%s)|(?:%s)’%(notpattern,pattern),string):
if not re.match(notpattern,matchobj.group()):
yield matchobj

def markexcept(pattern,notpattern,string):
substrings = []
current = 0

for matchobj in finditerexcept(pattern,notpattern,string):
    substrings.append(string[current:matchobj.start()])
    substrings.append('[' + matchobj.group() + ']')
    current = matchobj.end() #

substrings.append(string[current:])
return ''.join(substrings)

END CODE

sample=‘’‘winter tire
… tire
… retire
… tired
… snow tire
… snow tire
… some snowtires
… ‘’’
print markexcept(‘tire’,‘snow\s*tire’,sample)
winter [tire]
[tire]
re[tire]
[tire]d
snow tire
snow tire
some snowtires

–Mark

winterheat · January 26, 2008, 5:44am

SpringFlowers AutumnMoon wrote:

May I ask why there is the requirement for a single regular expression?

thanks for your post. a reason is that some text editor lets users
search all files using a regular expression… another reason is
that… if 2 lines are used to test… then what if that line actually
has tire and snowtire… then it may negate the whole line as a
result, even though we want to grep it due to the first word “tire”.

This is rather interesting to me. I recently (Dec-Jan) wrote a little
find/replace Ruby script that can deal with multiple files. I call the
utility rr.

What you’re suggesting is a pretty cool idea and opens a number of
possible improvements that I did not think about. I can extend rr to
take multiple regular expressions, and allow the user to say yes match
this regex and No do not match this regular expression. I could also
simply add an option to print out only the files where the Regular
Expressions has a match, not performing the find/replace.

I will have to think this through, especially this Sunday when I have
more time.

I am sorry that this doesn’t help you with your search for a single
regular expressions solution but I want to repeat that this seems so
much easier using two regular expressions that I think developing such a
utility would be worthwhile. I am really looking forward to
implementing these new ideas. For that I thank you!

I’m a rather intermediate Ruby programmer but if anyone would like to
check out rr they can at my blog. Here is a link to the most recent
article:

winterheat · January 26, 2008, 5:59am

Mark Tolonen, those were the exact Ruby negative look-behinds that I
used. Its good to see that we had the same idea!

winterheat · January 26, 2008, 6:59am

I just wrote up a quick script to do what I was thinking. I decided to
make a different utility only because of the complications that would
arise with tons of switches on the command line if I were to add it to a
find/replace utility. (The user would have to say which regex they
wanted for the actual replacement, and other inherent problems… moving
on)

So without further ado, here is my example

joe[~/code/script]$ cat > input
winter tire
tire
retire
tired
snow tire
snow tire
some snowtires

joe[~/code/script]$ grepall -2 tire --neg snow input
input [1]: winter tire
input [2]: tire
input [3]: retire
input [4]: tired

joe[~/code/script]$ grepall
usage: grepall [-#] ( [-n] regex ) [filenames]

- the number of regular expressions, defaults to 1

regex - regular expessions to be checked on the line
filenames - names of the input files to be parsed, if blank uses STDIN

options:
–neg or -n do not match this regular expression

special note:
When using bash, if you want backslashes in the replace portion make
sure
to use the multiple argument usage with single quotes for the
replacement.

The utility is hopefully easily to understand, although the usage is
tough to present:

line by line processing
in the above example the -2 says there will be two regular
expressions
the first is /tire/ and that needs to match
the second is /snow/ and that is Negated because of the --neg (or
just -n) option
the last argument is the filename

The output needs to be tweaked, maybe so its more like grep. Right now
it allows for multiple files so it prints the filename, [line number],
and the line where there was a full match for all the regular
expressions as correctly matched (negated where necessary). Obviously
this is very simple at the moment and it doesn’t cover the specific
situation you mentioned where there was the word tire and snowtire on
the same line.

However if that is an issue you can:

find and replace all words SNOW with SPECIAL_STRING in all files
do what you have to do…
turn all SPECIAL_STRINGs back into SNOW in all files

That can be done rather easily. You will have lost the case sensitivity
in the word SNOW, but you can get around that by making your
SPECIAL_STRING something like XsXnXoXwX based on the original case
values of snow. I hope that made sense.

Well I better get to bed, you made my night interesting!

winterheat · January 26, 2008, 10:55am

to add to the test cases, the regular expression must be able to grep

snowbird tire
tired on a snow day
snow tire and regular tire

winterheat · January 26, 2008, 11:50am

Summercool:

to add to the test cases, the regular expression must be able to grep
snow tire and regular tire

I presume there only the second tire has to be found.

This is my first try:

text = “”"
tire
word tire word
word retire word
word tired word
snowbird tire word
tired on a snow day word
snow tire and regular tire word
word snow tire word
word snow tire word
word some snowtires word
“”"

import re

def finder(text):
patt = re.compile( r"\b (\w*) \s* (tire)", re.VERBOSE)
for mo in patt.finditer(text):
if not mo.group(1).endswith(“snow”):
yield mo.start(2)

for end in finder(text):
print end

The (lazy) output is the starting point of the “tire” that match:

1
11
28
43
63
73
120

Bye,
bearophile

winterheat · January 26, 2008, 12:35pm

On Jan 26, 1:16 am, Summercool [email protected] wrote:

snow tire
snow tire
some snowtires

need to do it in one regular expression

Try the answer here:
[Tutor] Regex [negative lookbehind / use HTMLParser to parse HTML]

winterheat · January 26, 2008, 12:56pm

Paddy:

Try the answer here:
[Tutor] Regex [negative lookbehind / use HTMLParser to parse HTML]

But in the OP problem there can be variable-sized spaces in the
middle…

Bye,
bearophile

winterheat · January 26, 2008, 10:40pm

[A complimentary Cc of this posting was sent to
Summercool
[email protected]], who wrote in article
[email protected]:

snow tire
some snowtires

This does not describe the problem completely. What about

thisnow tire
snow; tire

etc? Anyway, one of the obvious modifications of

(^ | \b(?!snow) \w+ ) \W* tire

should work.

Hope this helps,
Ilya

winterheat · January 28, 2008, 9:15pm

Greg Bacon schreef:

my @tests = (
[ “some snowtires” => NO_MATCH ],
);
[…]

I negated the test, to make the regex simpler:

my $snow_tire = qr/
snow [[:blank:]]* tire (?!.*tire)
/x;

my $fail;
for (@tests) {
my($str,$want) = @$_;
my $got = $str !~ /$snow_tire/;
my $pass = !!$want == !!$got;

print "$str: ", ($pass ? “PASS” : “FAIL”), “\n”;

++$fail unless $pass;
}

print “\n”, (!$fail ? “PASS” : “FAIL”), “\n”;

END

–
Affijn, Ruud

“Gewoon is een tijger.”

winterheat · January 28, 2008, 7:56pm

The code below at least passes your tests.

Hope it helps,
Greg

#! /usr/bin/perl

use warnings;
use strict;

use constant {
MATCH => 1,
NO_MATCH => 0,
};

my @tests = (
[ “winter tire”, => MATCH ],
[ “tire”, => MATCH ],
[ “retire”, => MATCH ],
[ “tired”, => MATCH ],
[ “snowbird tire”, => MATCH ],
[ “tired on a snow day”, => MATCH ],
[ “snow tire and regular tire”, => MATCH ],
[ " tire" => MATCH ],
[ “snow tire” => NO_MATCH ],
[ “snow tire” => NO_MATCH ],
[ “some snowtires” => NO_MATCH ],
);

my $not_snow_tire = qr/
^ \s* tire |
([^w\s]|[^o]w|[^n]ow|[^s]now)\s*tire
/xi;

my $fail;
for (@tests) {
my($str,$want) = @$_;
my $got = $str =~ /$not_snow_tire/;
my $pass = !!$want == !!$got;

print "$str: ", ($pass ? “PASS” : “FAIL”), “\n”;

++$fail unless $pass;
}

print “\n”, (!$fail ? “PASS” : “FAIL”), “\n”;

END

winterheat · January 28, 2008, 10:41pm

On Jan 25, 7:16 pm, Summercool [email protected] wrote:

snowtire

Too bad pyparsing’s not an option. Here’s what it would look like:

data = “”"
Match:

winter tire
tire
retire
tired

But not match:

snow tire
snow tire
some snowtires

snowbird tire
tired on a snow day
snow tire and regular tire

“”"

from pyparsing import CaselessLiteral,Literal,line

caseless wasn’t really necessary but you never know

when you’ll run into a “Snow tire”

snow = CaselessLiteral(“snow”)
tire = Literal(“tire”)
tire.ignore(snow + tire)

for matchTokens,matchStart,matchEnd in tire.scanString(data):
print line(matchStart, data)

Prints:

winter tire
tire
retire
tired
snowbird tire
tired on a snow day
snow tire and regular tire

– Paul

winterheat · January 29, 2008, 6:15pm

In article [email protected],
Dr.Ruud [email protected] wrote:

: I negated the test, to make the regex simpler: […]

Yes, your approach is simpler. I assumed from the “need it all
in one pattern” constraint that the OP is feeding the regular
expression to some other program that is looking for matches.

I dunno. Maybe it was the familiar compulsion with Perl to
attempt to cram everything into a single pattern.

Greg

winterheat · January 30, 2008, 12:35am

Since Ruby does not have a negative look behind operator, I just used
the negative look ahead in a backwards way, et viola!

puts a.reverse.gsub(/erit(?!.*wons)/, ‘>>>&<<<’).reverse
somebody who is a regular expression guru… how do you negate a word
and grep for all words that is

<<>>

but not

snow tire

or

snowtire

so for example, it will grep for

winter <<>>
<<>>
re<<>>
<<>>d

but will not grep for

snow tire
snow tire
some snowtires

need to do it in one regular expression
=> nil

winterheat · January 30, 2008, 2:33am

I think I have a solution that matches the OP’s request

tests = [“winter tire”, “tire”, “retire”, “tired”, “snowbird tire”,
“tired on a snow day”, “snow tire and regular tire”, " tire", “snow
tire”, “snow tire”, “some snowtires”]
m,nm = tests.partition{ |str| str =~ /\A(?>snow *tire|.)*tire/ }
p m
=> [“winter tire”, “tire”, “retire”, “tired”, “snowbird tire”, “tired on
a snow day”, “snow tire and regular tire”, " tire"]
p nm
=> [“snow tire”, “snow tire”, “some snowtires”]

How is that?

Daniel

Regular expression negate a word (not character)

BEGIN CODE

END CODE

So without further ado, here is my example

- the number of regular expressions, defaults to 1

special note: When using bash, if you want backslashes in the replace portion make sure to use the multiple argument usage with single quotes for the replacement.

caseless wasn’t really necessary but you never know

when you’ll run into a “Snow tire”

special note:
When using bash, if you want backslashes in the replace portion make
sure
to use the multiple argument usage with single quotes for the
replacement.