Regular expression match and exclude

I am parsing a web page full of image links that also contain links to
the thumbnails for those images.

Here is my test data…

galleries/image1.jpg
galleries/image1thumb.jpg
galleries/image2.jpg
galleries/image2thumb.jpg
galleries/image3.jpg
galleries/image3thumb.jpg

If i use this expression…
/galleries.*(?=thumb).*jpg/

The results are just those lines containing the word thumb.

What I want to do is inverse this though and return only return that
DONT contain the word thumb.

On Sun, Sep 7, 2008 at 3:09 PM, Azalar — [email protected] wrote:

galleries/image3thumb.jpg

If i use this expression…
/galleries.*(?=thumb).*jpg/

The results are just those lines containing the word thumb.

What I want to do is inverse this though and return only return that
DONT contain the word thumb.

If you have all the links in an array, I would do the opposite: match
the ones that contain
thumb, rejecting those from the array

irb(main):001:0> links = %w{galleries/image1.jpg
galleries/image1thumb.jpg galleries/image2.jpg
galleries/image2thumb.jpg galleries/image3.jpg
galleries/image3thumb.jpg}
=> [“galleries/image1.jpg”, “galleries/image1thumb.jpg”,
“galleries/image2.jpg”, “galleries/image2thumb.jpg”,
“galleries/image3.jpg”, “galleries/image3thumb.jpg”]

irb(main):007:0> links.reject {|l| l =~ /thumb/}
=> [“galleries/image1.jpg”, “galleries/image2.jpg”,
“galleries/image3.jpg”]

You can as specific as you need for the regexp:

irb(main):008:0> links.reject {|l| l =~ /galleries/.*thumb.*jpg/}
=> [“galleries/image1.jpg”, “galleries/image2.jpg”,
“galleries/image3.jpg”]

Hope this helps,

Jesus.

The array is created when i use the scan method so if i used reject it
would become a two pass process which is what i am doing anyway so I was
wondering if regular expressions had built in support for this.

2008/9/7 Azalar — [email protected]:

The array is created when i use the scan method so if i used reject it
would become a two pass process which is what i am doing anyway so I was
wondering if regular expressions had built in support for this.

Will be hard. It’s easier if you do

%r{galleries/image\d+.jpg\z}

Kind regards

robert

I should mention that image isn’t a fixed word in this case I just used
that as an example.
It represents whatever the name of the image is

So the test data could be…

galleries/blah.jpg
galleries/blahthumb.jpg
galleries/landscape.jpg
galleries/landscapethumb.jpg
galleries/foo.jpg
galleries/foothumb.jpg

Robert K. wrote:

2008/9/7 Azalar — [email protected]:

The array is created when i use the scan method so if i used reject it
would become a two pass process which is what i am doing anyway so I was
wondering if regular expressions had built in support for this.

Will be hard. It’s easier if you do

%r{galleries/image\d+.jpg\z}

Kind regards

robert

-------- Original-Nachricht --------

Datum: Mon, 8 Sep 2008 02:46:12 +0900
Von: Azalar — [email protected]
An: [email protected]
Betreff: Re: regular expression match and exclude

galleries/landscapethumb.jpg

wondering if regular expressions had built in support for this.
Posted via http://www.ruby-forum.com/.
Dear Azalar,

to achieve searching for some text pattern that is not followed by
something else, you need
regular expressions with negative lookahead

As far as I know, this is not supported by Ruby’s 1.8.x regexps, but
there is a special library, Oniguruma,
with Ruby bindings available for this.

http://oniguruma.rubyforge.org/

Best regards,

Axel

2008/9/7 Axel E. [email protected]:

to achieve searching for some text pattern that is not followed by something else, you need
regular expressions with negative lookahead

Regex Tutorial - Lookahead and Lookbehind Zero-Length Assertions

As far as I know, this is not supported by Ruby’s 1.8.x regexps,

Negative lookahead is supported by that (and earlier) version.

The problem with negative lookahead is that - with the test data given

  • that the negative lookahead is difficult to get right:

irb(main):016:0> %w{foo.jpg foothumb.jpg}.each do |s|
irb(main):017:1* p [s, /\A\w+(?!thumb)\.jpg\z/ =~ s]
irb(main):018:1> end
[“foo.jpg”, 0]
[“foothumb.jpg”, 0]
=> [“foo.jpg”, “foothumb.jpg”]

I am not saying that it won’t work but off the top of my head I do not
have a solution that works. In any case, it’s easier to exclude
matches outside the regular expression engine. One way would be to
make thumb a capturing group and check whether the group is present,
like

irb(main):022:0> %w{foo.jpg foothumb.jpg}.each do |s|
irb(main):023:1* p [s, /\A\w+?(thumb)?\.jpg\z/ =~ s, $1]
irb(main):024:1> end
[“foo.jpg”, 0, nil]
[“foothumb.jpg”, 0, “thumb”]
=> [“foo.jpg”, “foothumb.jpg”]

and use that criterion for exclusion.

Now you can meditate on why the negative lookahead did not work. :slight_smile:

Kind regards

robert

Maybe you can write the regex pattern like this:

irb(main):022:0> imgs.each do |s|
irb(main):023:1* p [s, /^galleries\/((?!thumb).)+\.jpg$/ =~ s]
irb(main):024:1> end
[“galleries/blah.jpg”, 0]
[“galleries/blahthumb.jpg”, nil]
[“galleries/landscape.jpg”, 0]
[“galleries/landscapethumb.jpg”, nil]
[“galleries/foo.jpg”, 0]
[“galleries/foothumb.jpg”, nil]
=> [“galleries/blah.jpg”, “galleries/blahthumb.jpg”,
“galleries/landscape.jpg”, “galleries/landscapethumb.jpg”,
“galleries/foo.jpg”, “galleries/foothumb.jpg”]

From: Azalar — [mailto:[email protected]]

The array is created when i use the scan method so if i used

reject it

would become a two pass process which is what i am doing

anyway so I was

wondering if regular expressions had built in support for this.

well if it’s thumb for thumbs sake, we can be stubborn about it :slight_smile:

irb(main):036:0> re2
=> /galleries.*?([^t][^h][^u][^m][^b]).jpg/
irb(main):038:0> g.select{|x| x=~re2 }
=> [“galleries/image1.jpg”, “galleries/image2.jpg”,
“galleries/image3.jpg”]

kind regards -botp

On 9/7/08, Azalar — [email protected] wrote:

galleries/image3thumb.jpg

If i use this expression…
/galleries.*(?=thumb).*jpg/

The results are just those lines containing the word thumb.

What I want to do is inverse this though and return only return that
DONT contain the word thumb.

Using Reg, my DSL for declarative programming in ruby, you can use the
logical operators (& | ~) to combine regexps. I call this ‘match
arithmetic’. The resulting objects implement === and =~, so they can
be used with Array#grep.

require ‘rubygems’
require ‘reg’
images=[“galleries/foo.jpg”, “galleries/foothumb.jpg”]
images.grep /galleries/.*.jpg$/ & ~/thumb.jpg$/
#in galleries and not thumb

There might well be a pure regexp way, but I’ve been unable to come up
with a clean one…

2008/9/8 Caleb C. [email protected]:

/^galleries.*([^t]humb|[^h]umb|[^u]mb|[^m]b|[^b]).jpg$/ #bleah
There. That’s a pretty nice pure regexp.
Now just make the group non capturing and you’re a tad more efficient.
:slight_smile:

%r{\Agalleries/(?:(?!thumb.jpg\z).)+.jpg\z}

Cheers

robert

On 9/7/08, Peña, Botp [email protected] wrote:

irb(main):036:0> re2
=> /galleries.*?([^t][^h][^u][^m][^b]).jpg/
irb(main):038:0> g.select{|x| x=~re2 }
=> [“galleries/image1.jpg”, “galleries/image2.jpg”, “galleries/image3.jpg”]

Unfortunately this way will fail to match (for instance) “bomb.jpg”.
If you want to go this route, you need to do something like this:
(untested!)

/^galleries.*([^t]humb|[^h]umb|[^u]mb|[^m]b|[^b]).jpg$/ #bleah

Patrick He wrote:

/^galleries/((?!thumb).)+.jpg$/

Ah, brilliant! But it doesn’t quite work. It fails to match
“thumb_foo.jpg”, which probably should match here. A simple
modification should fix it, tho:

/^galleries/((?!thumb.jpg$).)+.jpg$/

There. That’s a pretty nice pure regexp.

2008/9/8 Patrick He [email protected]:

[“galleries/foothumb.jpg”, nil]
=> [“galleries/blah.jpg”, “galleries/blahthumb.jpg”,
“galleries/landscape.jpg”, “galleries/landscapethumb.jpg”,
“galleries/foo.jpg”, “galleries/foothumb.jpg”]

Good idea! But

irb(main):001:0> /^galleries/((?!thumb).)+.jpg$/ =~
“galleries/athumbb.jpg”
=> nil

And this should not be the case. I believe this one is better

irb(main):002:0> /^galleries/((?!thumb.jpg).)+.jpg$/ =~
“galleries/athumbb.jpg”
=> 0

Also, you do not need the capturing group, so

irb(main):003:0> /^galleries/(?:(?!thumb.jpg).)+.jpg$/ =~
“galleries/athumbb.jpg”
=> 0

Or even

irb(main):004:0> /^galleries/(?:(?!thumb.).)+.jpg$/ =~
“galleries/athumbb.jpg”
=> 0

Kind regards

robert