Slow regular expressions :(

rhaus · July 27, 2006, 6:40pm

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just try this program:

str1 = "foo " * 70 + "foo ";
str2 = "foo " * 70 + "fo ";

strings = str1, str2;

strings.each { |s|
print “Matching #{s}\n”;
if(s =~ /^(\sfoo\s)*$/)
print “YES!\n”;
else
print “NO!\n”;
end
}

and compare it with this perl program:
my $str1 = ("foo " x 70) . "foo ";
my $str2 = ("foo " x 70) . "fo ";

my @strings = ( $str1, $str2 );

foreach $s ( @strings ) {
print “Matching $s\n”;
if($s =~ /^(\sfoo\s)*$/) {
print “YES!\n”;
} else {
print “NO!\n”;
}
}

On my machine, the perl version takes 0.229 seconds. The ruby version
was still running after 519.662 seconds when I killed it.

Is this being taken care of for the next release?

rhaus · July 27, 2006, 6:52pm

On 2006-07-27, at 13:41 , Roman H. wrote:

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Is this being taken care of for the next release?

Ruby 1.9 uses a different regex engine, oniguruma. Not sure if it
addresses this particular problem.

rhaus · July 27, 2006, 7:36pm

On Fri, Jul 28, 2006 at 01:41:05AM +0900, Roman H. wrote:

strings.each { |s|
print “Matching #{s}\n”;
if(s =~ /^(\sfoo\s)*$/)
print “YES!\n”;
else
print “NO!\n”;
end
}

Ah, now while I’m not saying that Ruby’s regex engine is slow–it
is–I think it’s more likely here that you hit a pathological edge
case in how it works, specifically the ‘\s*’ on each side of ‘foo’.

When the code is changed to

strings.each do |s|
  print "Matching #{s}\n"
  if s =~ /^(foo\s*)*$/
    print "Yes!\n"
  else
    print "No!\n"
  end
end

The problem disappears and the Ruby and Perl versions of that code
benchmark similarly.

Patterns in the form (xyx*)* have a habit of acting pathologically and
there’s almost always a better and clearer way of writing them, mainly
because they have a habit of causing a nasty amount of backtracking.
This applies to a good number of regex engines, and not just Ruby’s.

On my machine, the perl version takes 0.229 seconds. The ruby version
was still running after 519.662 seconds when I killed it.

Is this being taken care of for the next release?

My suspicion is that Ruby’s regex package doesn’t account for this
pathological case, where as Perl’s–which, to be frank, is pretty much
the best regex engine out there bar none except for its support for
Unicode properties–does.

Not a clue as to whether it’ll be taken care of, though. JARH.

K.

rhaus · July 27, 2006, 8:22pm

Caleb C. wrote:

strings = str1, str2;

imagine that you didn’t know that you’d created a backtracking
I’m afraid I’m not explaining very well why this Regexp is a
monster… maybe someone else will take a stab.

It would take too long and I’m tired. Jeffrey Friedl’s “Mastering
Regular
Expressions” is the definitive source on the subject. That being said,
I think
the OP needs a lesson on greedy vs non-greedy regular expressions.

Also note that Perl must have optimized for this sort of backtracking
after
5.005_03, because it hangs as well. I’d be curious as to their reasons
for
doing this, but I have my suspicions.

Regards,

Dan

This communication is the property of Qwest and may contain confidential
or
privileged information. Unauthorized use of this communication is
strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and
destroy
all copies of the communication and any attachments.

rhaus · July 27, 2006, 7:43pm

On 7/27/06, Roman H. [email protected] wrote:

strings.each { |s|
print “Matching #{s}\n”;
if(s =~ /^(\sfoo\s)*$/)
print “YES!\n”;
else
print “NO!\n”;
end
}

If you take out the first \s* (or move it outside the parens), the
slowness goes away. That first \s* isn’t needed. Another trick that
may help here is changing the *'s to +'s.

This is a case of pathological backtracking. Usually, these things are
solved by carefully tuning the Regexp, rather than expecting the
Regexp engine to fix it…

Yes, perl optimizes this case, but the fact is that you’re asking for
a lot of silly extra backtracking, and that is what ruby gives you. I
imagine that you didn’t know that you’d created a backtracking
monster…

Basically, as long as everything is matching, you’re fine and matches
will be nice and zippy. But once that ‘fo’ causes a mismatch, the
engine has to go back and check every possible way that the spaces in
the string could be matched by either of the 2 whitespace matchers.
Since there are ~70 spaces in your string and each could be matched by
either \s*, there are about 2**70 possibilites to explore, total. You
can see how that might take a little time.

I’m afraid I’m not explaining very well why this Regexp is a
monster… maybe someone else will take a stab.

rhaus · July 27, 2006, 10:51pm

Roman H. [email protected] writes:

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just a note that there are some patterns where I found (in my tests,
on a cygwin environment I can’t recreate anymore) that ruby beat perl.

http://snowplow.org/martin/rebench/

I should note that the people at perlmonks.org disbelieve my results,
or rather, believe that my results are being skewed by some strange
cygwin issue. I haven’t gotten off my butt and redone the same tests
on linux.

rhaus · July 28, 2006, 12:13am

Caleb C. wrote:

If you take out the first \s* (or move it outside the parens), the
slowness goes away. That first \s* isn’t needed. Another trick that
may help here is changing the *'s to +'s.

This is a case of pathological backtracking. Usually, these things are
solved by carefully tuning the Regexp, rather than expecting the
Regexp engine to fix it…

No. I am not looking for help on regular expressions here and I know
that this particular expression could be optimized. I also know that
some regular expressions with backtracking can get exponentially slow.

However, regular expression can get generated automatically or there
could be other reasons why an optimization a la perl is helpful.

The example simply points out that ruby uses a strategy that does not do
the optimization that is present in perl. My question was, whether Ruby
is going to fix this.

If you can answer that question, I would apreciate it. If not, maybe
somebody
else can.

rhaus · July 28, 2006, 2:17am

Roman H. wrote:

Caleb C. wrote:

If you take out the first \s* (or move it outside the parens), the
slowness goes away. That first \s* isn’t needed. Another trick that
may help here is changing the *'s to +'s.

This is a case of pathological backtracking. Usually, these things are
solved by carefully tuning the Regexp, rather than expecting the
Regexp engine to fix it…

No.

I.e., “No, I won’t take out the first \s* (or move it outside the
parens). No, I won’t tune the regular expression. No, this isn’t
a case of pathological backtracking. No, I won’t stop holding
my breath until I turn blue.”

I am not looking for help on regular expressions here and I know
that this particular expression could be optimized.

No. This regular expression has been pessimized.

              I also know that
some regular expressions with backtracking can get exponentially slow.

However, regular expression can get generated automatically

Stop generating crappy regular expressions.

                   or there
could be other reasons why an optimization a la perl is helpful.

The example simply points out that ruby uses a strategy that does not do
the optimization that is present in perl. My question was, whether Ruby
is going to fix this.

The arrogance. He refuses to listen to any advice about cleaning
up his crap and then demands to know when someone is going
to “fix” Ruby.

I hope that Ruby doesn’t start pandering to those who obdurately
refuse to clean up their feculent waste.

rhaus · July 28, 2006, 2:20am

Daniel B. wrote:

Also note that Perl must have optimized for this sort of backtracking after
5.005_03, because it hangs as well. I’d be curious as to their reasons for
doing this, but I have my suspicions.

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg00585.html

http://perlmonks.org/index.pl?node_id=502408

rhaus · July 28, 2006, 2:57am

Gene T. wrote:

Daniel B. wrote:

Also note that Perl must have optimized for this sort of backtracking after
5.005_03, because it hangs as well. I’d be curious as to their reasons for
doing this, but I have my suspicions.

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg00585.html

http://perlmonks.org/index.pl?node_id=502408

(give credt where credit due) this blog pointed to Ilya Z explanation

http://cwilliams.textdriven.com/articles/2005/11/21/ruby-can-learn-from-perl

rhaus · July 27, 2006, 9:15pm

Roman H. wrote:

my $str1 = ("foo " x 70) . "foo ";
}
}

On my machine, the perl version takes 0.229 seconds. The ruby version
was still running after 519.662 seconds when I killed it.

Is this being taken care of for the next release?

–
Posted via http://www.ruby-forum.com/.

Here’s the way to do it.

strings = " foo " * 70 + "foo ",
"foo " * 70 + "foo ",
" foo " * 70 + "fo ",
“”

strings.each { |s|
puts “Matching #{s}”
if s =~ /^ \s* # May start with whitespace.
# The rest of the string must be 0 or more
# occurrences of “foo” plus whitespace.
(?: foo \s+ ) *
$/x
puts “YES!”
else
puts “NO!”
end
}

rhaus · July 28, 2006, 5:02am

“William J.” [email protected] wrote in message
news:[email protected]…

No.

I.e., “No, I won’t take out the first \s* (or move it outside the
parens). No, I won’t tune the regular expression. No, this isn’t
a case of pathological backtracking. No, I won’t stop holding
my breath until I turn blue.”

This attitude is noted, below...

However, regular expression can get generated automatically

Stop generating crappy regular expressions.

This is easier said than done.  It can be very hard to write code

that
can specifically avoid generating pathological regexes, especially if
you
don’t understand how regexes work. Indeed, it would be nice if the
regexp
engine can do this for you. After all, it is the job of the programming
language to make programming easier for us, is it not?

I hope that Ruby doesn’t start pandering to those who obdurately
refuse to clean up their feculent waste.

While Roman H. obviously created this thread with a somewhat

beligerent attitude, it is rarely wise to reply in kind.
There is much merit in his suggestion to include this PERL regexp
optimisation feature in Ruby…

rhaus · July 28, 2006, 7:57am

“Bill K.” [email protected] wrote in message
news:003901c6b202$d8a10db0$6442a8c0@musicbox…

don’t understand how regexes work. Indeed, it would be nice if the

the solution doesn’t cover all regexps, so some regexps
can still produce exponential backtracking.

?

The Perl guys seem to feel the trade-off was worth it. Maybe
so. But it doesn’t seem 100% clear cut to me.

It is not clear cut to me, either, and yes, it is worth noting that

there are performance trade-offs. However, the poster to whom I was
responding dismissed it, out of hand, by blaming the programmer, much
like
the 60’s auto industry blaming accidents on drivers as an excuse for not
bothering with safety features…

If the performance hit is great and the regex pattern uncommon, it

may
very well not be worth it. This is, of course, debatable and that’s my
point. Personally, I don’t use Ruby to write fast programs, I use it to
write (correct) programs fast.
At the very least, perhaps the PERL Regexp engine can be written as
a
Ruby extension and used as another Regexp class that we “require” in our
code when we feel it’s more appropriate? I’m a big believer in the Best
of
Both Worlds solution…

rhaus · July 28, 2006, 7:03am

From: “Just Another Victim of the Ambient M.”
[email protected]

engine can do this for you. After all, it is the job of the programming
language to make programming easier for us, is it not?

Is it worth noting that the article at
http://perlmonks.org/index.pl?node_id=502408
indicates:

a performance penalty is incurred for keeping track of the
“have we been here before” state;
the solution doesn’t cover all regexps, so some regexps
can still produce exponential backtracking.

?

The Perl guys seem to feel the trade-off was worth it. Maybe
so. But it doesn’t seem 100% clear cut to me.

Regards,

Bill

rhaus · July 28, 2006, 8:40am

Bill K. wrote:

engine can do this for you. After all, it is the job of the programming
language to make programming easier for us, is it not?

Is it worth noting that the article at
http://perlmonks.org/index.pl?node_id=502408
indicates:

a performance penalty is incurred for keeping track of the
“have we been here before” state;

the solution doesn’t cover all regexps, so some regexps
can still produce exponential backtracking.

?

The Perl guys seem to feel the trade-off was worth it. Maybe
so. But it doesn’t seem 100% clear cut to me.

The important point is that the performance penalty is linear: those
matches that get slower will get slower by a constant factor. But: this
penalty will prevent an exponential slowdown in many, typical cases. I
have not waited how long Ruby would have taken in the simple example
case I gave, but the slowdown was already more than 2000-fold when I
killed it – much much more than the performance penalty would ever be.

Regarding the example: yes this is an extremely easy example meant for
illustration only. However, with complex hand-crafted regular
expressions or with automatically generated ones the same effect can
arise when it is much more complicated or even impossible to find an
equivalent regular expression that does not show the exponential
slowdown.

Exponentional slowdown can be an extremely bad thing, rendering an
application practically unusable. So I would say, the trade-off is
definitely worth it. Even if it cannot prevent all pathological cases
(my feeling is that most pathological cases that occur in practice can
be prevented though).

rhaus · July 28, 2006, 9:16am

write (correct) programs fast.
At the very least, perhaps the PERL Regexp engine can be written as a
Ruby extension and used as another Regexp class that we “require” in our
code when we feel it’s more appropriate? I’m a big believer in the Best of
Both Worlds solution…
No offence, but who stops you from using Perl ?!?
Just use perl when you’re too lazy to think about a regex and hope that
the engine will fix the crap you put in there, and use Ruby, when you do
have the time and the interest to actually think. How about that?
Wouldn’t this be “the best of both worlds”… ?

Of course you and all the other wise “haha I found one more bug in Ruby”
guys could actually follow the very good advice of reading ‘Mastering
Regular Expressions’, and stop dumping crap on this list, but hey I know
it’s not a perfect world

This is just disgusting, to take proud in your own stupidity and have
the audacity that others should turn it into wisdom on their time and
money. How about you two guys get together and write that extension you
so badly need and make yourself happy with your own hands? It’s a little
harder than whinning on a list, but hey, you’re smart guys, after all
you
“found where Ruby sucks” and others do a great job… and nobody else
from all the people on the list discovered this! Gosh, you must be
really smart…

I’d say “Good luck” but you’re probably just too smart to need it
anyway…

To all the other decent readers on the list:
I appologise for my post, I may be breaking the netiquette here, but
this was just too much…
I’m a Ruby nuby too, I am not the Grand Master of regular expressions,
but I love Ruby just the way it is, and it really hurts me to see this
crap attitude about Ruby’s “limitations” coming from people not able
to understand their own limitations…

All the best to all the decent people,
Alex

rhaus · July 28, 2006, 8:58am

On Fri, Jul 28, 2006 at 02:55:05PM +0900, Just Another Victim of the
Ambient M. wrote:

At the very least, perhaps the PERL Regexp engine can be written as a

Just a quick pseudo-cultural note . . .

The language is Perl. The parser/compiler/interpreter/blah is perl.
The term PERL is something used to identify texts one shouldn’t bother
reading.

rhaus · July 28, 2006, 10:07am

Alexandru E. Ungur wrote:

No offence, but who stops you from using Perl ?!?
Just use perl when you’re too lazy to think about a regex and hope that
the engine will fix the crap you put in there, and use Ruby, when you do
have the time and the interest to actually think. How about that?
Wouldn’t this be “the best of both worlds”… ?

Of course you and all the other wise “haha I found one more bug in Ruby”
guys could actually follow the very good advice of reading ‘Mastering
Regular Expressions’, and stop dumping crap on this list, but hey I know
it’s not a perfect world

This is just disgusting, to take proud in your own stupidity and have
the audacity that others should turn it into wisdom on their time and
money. How about you two guys get together and write that extension you
so badly need and make yourself happy with your own hands? It’s a little
harder than whinning on a list, but hey, you’re smart guys, after all
you
“found where Ruby sucks” and others do a great job… and nobody else
from all the people on the list discovered this! Gosh, you must be
really smart…

I’d say “Good luck” but you’re probably just too smart to need it
anyway…
How often do you want to repeat embarrassing yourself here by showing
a total lack of understanding the actual issue here?
Why do you join into a discussion with your childish fanboyism when
you actually have no idea what you are talking about?
As I have pointed out the example is illustrative – in the general case
it is much harder and sometimes impossible to simply manually correct
a regular expression that shows exponential backtracking behavior.
It is in many cases possible however to prevent that exponential
backtracking behavior with a nifty optimization of the matching engine
(though it comes at a penalty and does not work for all cases, as has
been
pointed out).
Also, the obvious reason for not using perl in the first place is that
Ruby has a lot of very fine things to offer that many prefer over perl.
My experience with Ruby is exactly that – I love the beauty of the
design and many details about the language, but there are still flaws.
Apart from the one I pointed out here, my main other concern is the lack
of proper Unicode support – though I hear that this is being worked on
for the next release.

So could I humbly ask you to try to understand these points (and the
issues with complex regular expressions, for that matter) and either
contribute something useful finally to this thread or else go somewhere
else to live out your childishness?

rhaus · July 28, 2006, 10:41am

“Alexandru E. Ungur” [email protected] wrote in message
news:[email protected]…

sender: “Just Another Victim of the Ambient M.” date: “Fri, Jul
28, 2006 at 02:55:05PM +0900” <<<EOQ
At the very least, perhaps the PERL Regexp engine can be written as a
Ruby extension and used as another Regexp class that we “require” in our
code when we feel it’s more appropriate? I’m a big believer in the Best
of
Both Worlds solution…
No offence, but who stops you from using Perl ?!?

Your claim of not wanting to offend is questionable; see below...

Just use perl when you’re too lazy to think about a regex and hope that
the engine will fix the crap you put in there, and use Ruby, when you do
have the time and the interest to actually think. How about that?
Wouldn’t this be “the best of both worlds”… ?

Except that, in my personal opinion, writing the rest of the script

in
Perl (does this capitalization follow protocol?) is not the better part
of
the two worlds. I just suggested that the robustness of Perl’s rexexp
engine might be beneficial. Indeed, I have never encountered this
pathalogical case but I’m glad I know of it, now!

Of course you and all the other wise “haha I found one more bug in Ruby”
guys could actually follow the very good advice of reading ‘Mastering
Regular Expressions’, and stop dumping crap on this list, but hey I know
it’s not a perfect world

Me and all the others?  I suggest you read my posts again because

you’re
lumping me into a group to which I don’t belong. All the happy faces in
the
world won’t hide your passive aggresive attitude towards honest
suggestions
and criticisms.

This is just disgusting, to take proud in your own stupidity and have
the audacity that others should turn it into wisdom on their time and
money. How about you two guys get together and write that extension you
so badly need and make yourself happy with your own hands? It’s a little
harder than whinning on a list, but hey, you’re smart guys, after all you
“found where Ruby sucks” and others do a great job… and nobody else
from all the people on the list discovered this! Gosh, you must be
really smart…

Honestly, who's taking pride in anything?
The idea that not having intimate knowledge of regular expressions

is
stupid and audacious is pretentious, arrogant and elitist. Even if you
knew
what conditions cause the exponential regex growth, that doesn’t mean
you
will know how to remove them if you generated the regexes
programmatically.
It is useful for the regexp implementation to reduce this for us.

I’d say “Good luck” but you’re probably just too smart to need it
anyway…

Good luck with what?  To whom do you think you're speaking?  With

what
do you think I need luck? What do you think the issue is? In short:
what
are you talking about?

To all the other decent readers on the list:
I appologise for my post, I may be breaking the netiquette here, but
this was just too much…
I’m a Ruby nuby too, I am not the Grand Master of regular expressions,
but I love Ruby just the way it is, and it really hurts me to see this
crap attitude about Ruby’s “limitations” coming from people not able
to understand their own limitations…

Ironically, you speak of people who are "not able to understand

their
own limitations" but, perhaps, they do understand their own limitations
and
that’s why they want a more robust regexp engine to protect themselves
form
their own mistakes?

You are far too defensive.  Attempts to improve the language are not

attacks on the language. Take a look at this article, titled: How Ruby
Sucks…

http://www.rubyist.net/~matz/slides/rc2003/mgp00003.html

...and you'll never guess who wrote it.

I can tell you're a newbie, too, because most people here actually

read
what’s posted and respond with reasoned answers. In particular, they
are
not defensive about ideas, suggestions, or even criticisms of the
language.
How do you think a language improves? How do you think Ruby became as
good
as it is, now?

Here's the only insult I am going to throw.
I don't understand how a newbie can be such a fanboy...

rhaus · July 28, 2006, 12:33pm

Alexandru E. Ungur wrote:

write (correct) programs fast.
Of course you and all the other wise “haha I found one more bug in Ruby”
from all the people on the list discovered this! Gosh, you must be
I’m a Ruby nuby too, I am not the Grand Master of regular expressions,
but I love Ruby just the way it is, and it really hurts me to see this
crap attitude about Ruby’s “limitations” coming from people not able
to understand their own limitations…

What’s wrong with people wanting to improve a language? Fortran
definitely has changed over its 50+ years, why can’t Ruby or any other
language? Languages (or operating systems, or any other program)
benefit from the input of many people. No one can create a “perfect”
language on their first try. Ruby obviously wasn’t or there wouldn’t be
all the work on new versions.
I have been programming for over 25 years in Business Basic and find
that “old” language has many features that I would love to see in Ruby,
but from the looks of it probably won’t see. Some of them I could
probably write methods to duplicate but they won’t be the same and I
will continue to think first of the way I used to write a program before
trying to figure out the Ruby way. My results probably won’t be optimal
and it will take a long time to fix some of the problems.
The other posters were showing that the Ruby way of doing some things
wasn’t the best way to do them and asking why there couldn’t be changes
made to improve the current way. I see no problem in them preferring
this other method and maybe it would benefit more people than just them.