Case against ".*" regexp


#1

Some time ago I remember reading a post in this list against using
“.*” regexp but I cannot find it – I guess I don’t remember the right
keywords :frowning:
Does anyone remember that?

Thanks a lot,
Ed

Encontrá a “Tu psicópata favorito” http://tuxmaniac.blogspot.com

The future is not what it used to be.
– Paul Valéry


#2

I didn’t see that, but the .* regex can easily misfire. There’s a
great book on this called “Mastering Regular Expressions” from
O’Reilly. The author’s name I think is Jeff Friedl.


Giles B.
http://www.gilesgoatboy.org


http://giles.tumblr.com/


#3

On 16.03.2007 20:58, Edgardo H. wrote:

Some time ago I remember reading a post in this list against using
“.*” regexp but I cannot find it – I guess I don’t remember the right
keywords :frowning:
Does anyone remember that?

Not exactly. But you should not use “." nested in another starred
expression because that will usually lead to massive backtracking.
Another problem is, that ".
” matches the empty string which is often
not what you want.

Kind regards

robert


#4

On Sat, Mar 17, 2007 at 05:10:08AM +0900, Robert K. wrote:

On 16.03.2007 20:58, Edgardo H. wrote:

Some time ago I remember reading a post in this list against using
“.*” regexp but I cannot find it – I guess I don’t remember the right
keywords :frowning:
Does anyone remember that?

Not exactly. But you should not use “." nested in another starred
expression because that will usually lead to massive backtracking.
Another problem is, that ".
” matches the empty string which is often
not what you want.

And also eats as much string as it can without causing the regexp to
fail -
.*? is the non-greedy form.


#5

Edgardo H. schrieb:

Some time ago I remember reading a post in this list against using
“.*” regexp but I cannot find it – I guess I don’t remember the right
keywords :frowning:
Did you probably mean this thread?
http://www.ruby-forum.com/topic/100364

regards
Jan


#6

On 3/16/07, Jan F. removed_email_address@domain.invalid wrote:

Edgardo H. schrieb:

Some time ago I remember reading a post in this list against using
“.*” regexp but I cannot find it – I guess I don’t remember the right
keywords :frowning:
Did you probably mean this thread?
http://www.ruby-forum.com/topic/100364

Hmm I am not sure this one was what OP meant, was there not a problem
with the greedy match eating up “<” s in xml/html

if I remember correctly OPITOP (OP in the other post :wink: asked why

%r{(<.>)} === “
Regexp.last_match.captures gave [’
’] instead of [’’, ‘’,
‘’]

regards
Jan

HTH
Robert


#7

“Edgardo H.” removed_email_address@domain.invalid writes:

Some time ago I remember reading a post in this list against using
“.*” regexp but I cannot find it – I guess I don’t remember the right
keywords :frowning:
Does anyone remember that?

Thanks a lot,
Ed

I don’t remember seeing anything that explicitly states that you should
not use
.* - however, many people do use it when they shouldn’t. If we assume
the
regexp engine is correct (and there have been some posts regarding the
correctness of ruby’s regexp in version 1.8), there shouldn’t be any
issue with
using .*, but there are some points to consider. Many of these relate to
the
more general application of regular expressions. Possibly the main issue
relates to how the RE is anchored.

If you just have a RE of ., well, your pretty much matching against the
whole
string and I guess you would say the match is pretty pointless. More
often you
will use .
with some other constructs. In this situation, you do need
to be
careful to ensure the RE is anchored in some way. If not adequately
anchored,
your RE match can involve a lot of backtracking. The .* is greedy and
will
attempt to match the biggest string possible, then the next biggest and
then
the next to next biggest and so on. If you don’t have adequate
anchoring, in
the worst case, it will back track to the very first character - for a
long
string, this could be very inefficient. In fact, I remember seeing a
post from
someone in the perl group years ago who thought they’d found a bug in
perl that
caused it to go into an infinite loop. It turns out it wasn’t infinite,
just a
poorly anchored RE which was taking so long to do all the backtracking
that it
gave the appearance of an infinite loop - if theuser had waited long
enough,
the program would have terminated eventually.

Often, people use .* rather than spending the time to analyse the real
patterns
and strings they are processing. Anchoring to the beginning/end of the
string
is often sufficient to drastically improve performance - but using
non-greedy
modifiers where appropriate and anchoring to larger distinct patterns in
your
string will help. If you know all your strings will have certain
characteristics, like a sequence of 4 digits, then incorporate that
information
into your RE - put \d\d\d\d (or whatever) rather than .* to represent
that
pattern. Think about ways to give the regexp engine as many clues or as
much
information as possible and your unlikely to get poor performance or
unexpected
results.

There is a very good book from O’Reilly called “Mastering Regular
Expressions”

  • can’t remember the author’s name, but recommend it as a read. Its not
    a thick
    book and has some really interesting background and explination of
    different
    regexp engines/approaches.

Tim