Non-greediness in a regex - need some help verifying syntax

All,

Need some medium - level regex help.

Here’s my regex: /~^LNK:[\t\r\n]+?^~/m

I’m trying to find all occurrences of strings in my big string that are
between
~^LNK: and ^~ sequences of characters that have at least one tab, form
feed, or newline character between those two characters. I use the
multiline option so that I can match on the newlines.

What I’m seeing is the string that is consumed by this regex spans many
many many
~^LNK ^~ pairs so that I am removing a bunch of tabs, newlines, etc.
that I don’t want to.

I understand the concept of greediness in regexes, so I put the ? after
the [\t\r\n] sequence.

Why is the match spanning so many pairs of the delimiter sequences? Why
doesn’t regex engine stop attempting to match when it sees that first ^~
after the ~^LNK:?

Any help is appreciated.

Thanks,
Wes

On 8/3/06, Wes G. [email protected] wrote:

All,

Need some medium - level regex help.

Here’s my regex: /~^LNK:[\t\r\n]+?^~/m

Hmmm I fail to reproduce the problem is there nothing missing between
“[\t\r\n]+?” and “^” ?
And if so the missing link is probably what’s consuming all your data.

I’m trying to find all occurrences of strings in my big string that are

between
~^LNK: and ^~ sequences of characters that have at least one tab, form
feed, or newline character between those two characters.

and something else as I said above, no?

Robert


Deux choses sont infinies : l’univers et la bêtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

  • Albert Einstein

I realized I made an error when I did the original post.

Now my problem is that it won’t find any of these occurrences now.

So @bigstring.scan(/~^LNK:[\t\r\n]+?^~/m) isn’t returning anything.
My guess is because there are no occurrences of a tab, newline, or line
feed character immediately after ~^LNK.

If I do /~^LNK:.?[\t\r\n]+?.?^~/m - that should pick up what I want,
correct?

Wes

I’m not exactly an expert on regexs, to say the least, but I
think .*? always matches an empty string and is therefore useless. I
would try something like

@bigstring.scan(/~^LNK:[^\t\r\n][\t\r\n]+?[^\t\r\n]^~/m)

I have not test this.

Regards, Morton

On Aug 3, 2006, at 6:09 PM, Wes G. wrote:

Wes

–Posted via http://www.ruby-forum.com/.

Morton G. wrote:

I’m not exactly an expert on regexs, to say the least, but I think .*?
always matches an empty string and is therefore useless. I would try
something like

@bigstring.scan(/~^LNK:[^\t\r\n][\t\r\n]+?[^\t\r\n]^~/m)

I have not test this.

Regards, Morton

No, that’s not true. .*? It will match whatever it needs to get to the
next item:

irb(main):001:0> “asidjoaisdj”.match(/.?j/)[0]
=> “asidj”
irb(main):002:0> “asidjoaisdj”.match(/.
?d/)[0]
=> “asid”
irb(main):003:0> “asidjoaisdj”.match(/.*?sdj/)[0]
=> “asidjoaisdj”

Therefore, it should work as Wes expects.

-Justin

Daniel,

Currently, I have this working using the .*? to match everything since I
am just passing the results into a block that then does a gsub on the
offending characters. Slightly inefficient, but as you pointed out,
much more readable.

Thanks for the through regex analysis though.

Wes

Wes G. [email protected] writes:

If I do /~^LNK:.?[\t\r\n]+?.?^~/m - that should pick up what I want,
correct?

Almost.

The problem is that with this text:

a = “~^LNK:foo^~\n\n~^LNK:bar^~”

You get a match of the whole text:

irb(main):009:0> a.scan(/~^LNK:.?[\t\r\n]+?.?^~/m)
=> ["~^LNK:foo^~\n\n~^LNK:bar^~"]

Where you obviously wanted to get no matches.

So, here’s what I suggest:

/~^LNK:(?:[^\t\r\n^]|^(?!~))[\t\r\n].?^~/m

Read that as:

‘~^LNK:’ followed by zero or more of:
Some character that isn’t \t, \r, \n, or ‘^’, OR
A ‘^’ character that isn’t followed by a ‘~’
Then a \t, \r, or \n character.
Then whatever is the minimum other characters necessary to get to ^~.

For these “containing at least one of” type problems, I often find it
useful to write the regular expression as:

begin sequence ( ~^LNK: )
zero or more characters with none of what we want
( (?:[^\t\r\n^]|^(?!~))* )
one of what we want
( [\t\r\n] )
.? ( .? )
end sequence ( ^~ )

For the related “at least n of” problem, (where n > 1), I do this:

begin sequence
(?:
zero or more characters with none of what we want
one of what we want
){n}
.*?
end sequence

The only tricky part is inside the “none of what we want” chunk, where
you have to take care that the “none of what we want” chunk can’t
swallow up your end sequence. (Depending on what you want and what
your end sequence is, you also need to be careful that the “one of
what we want” part can’t swallow part of your end sequence)

Sometimes it’s easier to just write a regular expression that gets
more matches than you want, and then throw away excess matches in
code:

lnk_regex = /~^LNK:.*?^~/
text.scan(lnk_regex) { |m|
next unless m[0] =~ /[\t\r\n]/

}

That can often be more readable too. Depending on your data, however,
it may be much, much slower than using a regular expression that finds
only what you need to begin with.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs