Questionable regex performance when using lazy matching and inverted character classes

tikhon_b · February 21, 2013, 12:12pm

Hi All,

I am trying to figure out a performance question with the ruby regex
matcher. Running the following code takes excessively longer than
expected.

Benchmark.measure {(“a” * 10_000).match(/a.*?b/)}
=> 1.140000 0.010000 1.150000 ( 1.195161)
Benchmark.measure {(“a” * 10_000).match(/a[^b]*b/)}
=> 1.400000 0.000000 1.400000 ( 1.462178)

Thinking the parser was doing some unnecessary backtracking I also tried
atomic grouping, and other simpler expressions with very limited
success:

Benchmark.measure {(“a” * 10_000).match(/(?>a[^b]*b)/)}
=> 1.400000 0.000000 1.400000 ( 1.462178)
Benchmark.measure {(“a” * 10_000).match(/aa*b/)}
=> 1.270000 0.000000 1.270000 ( 1.331137)
Benchmark.measure{val.match(/a.*b/)}
=> 0.460000 0.000000 0.460000 ( 0.487267)

Simplifying the expression further produces the expected results.

Benchmark.measure {(“a” * 10_000).match(/a*b/)}
=> 0.000000 0.000000 0.000000 ( 0.000093)

As best as I can tell this is caused by the repetition of the first
character in the expression in the string:

Benchmark.measure{(“a” + “c” * 9999).match(/ac*b/)}
=> 0.000000 0.000000 0.000000 ( 0.000372)

This may be a very simple vector for a DDOS attack against a server
using a similar expression to verify input data. What more, since valid
input data is not likely to resemble scenarios 1-6 the issue may go
undetected by a developers and testers not aware of this problem.

Conversely, other developers may be turned off from using a regex in a
place where it would perform well due to simple performance testing as
above.

This sort of performance hit is in neither the NodeJS nor Python regex
engines. Writing this message I think I got a better grasp of the cause
of the issue, but I still feel that it needs to be addressed for the
sake of security.

tikhon_b · February 21, 2013, 12:34pm

On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. [email protected]
wrote:

Hi All,

I am trying to figure out a performance question with the ruby regex
matcher. Running the following code takes excessively longer than
expected.

You are doing the measurement wrong. You do not only measure matching
but also creation of the String. Given that fact, all your results
are moot.

Benchmark.measure {(“a” * 10_000).match(/(?>a[^b]*b)/)}
=> 1.400000 0.000000 1.400000 ( 1.462178)

Benchmark.measure {(“a” * 10_000).match(/aa*b/)}
=> 1.270000 0.000000 1.270000 ( 1.331137)

Benchmark.measure{val.match(/a.*b/)}
=> 0.460000 0.000000 0.460000 ( 0.487267)

Where does ‘val’ come from and what does it reference?

engines. Writing this message I think I got a better grasp of the cause
of the issue, but I still feel that it needs to be addressed for the
sake of security.

Before we jump to conclusions please provide the proper and complete
test. Note also that it is a common fact that NFA’s can suffer from
backtracking in certain conditions. So I am not too surprised nor
worried. I did not look closely at all your test cases but often
backtracking issues can be remedied by using greediness modifiers
(either “greedy” or “reluctant”).

Kind regards

robert

tikhon_b · February 22, 2013, 1:21am

Robert K. wrote in post #1098199:

On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. [email protected]
wrote:

Hi All,

I am trying to figure out a performance question with the ruby regex
matcher. Running the following code takes excessively longer than
expected.

You are doing the measurement wrong. You do not only measure matching
but also creation of the String. Given that fact, all your results
are moot.

I actively considered this matter when submitting the post, but in the
end I included the string creating in the test to allow anyone to try
out the example by copying one line. The time necessary for the actual
creation process is negligible; I need to run the string creation ten
thousand times before the result is substantial enough to significantly
affect the numbers I’m getting.

Benchmark.measure {10_000.times {“a” * 10_000}}
=> 0.100000 0.000000 0.100000 ( 0.105442)

Case in point:

val = “a” * 10_000
3a. Benchmark.measure {val.match(/(?>a[^b]*b)/)}
=> 1.430000 0.000000 1.430000 ( 1.490874)

4a. Benchmark.measure {val.match(/aa*b/)}
=> 1.310000 0.000000 1.310000 ( 1.366131)

Benchmark.measure {(“a” * 10_000).match(/(?>a[^b]*b)/)}
=> 1.400000 0.000000 1.400000 ( 1.462178)

Benchmark.measure {(“a” * 10_000).match(/aa*b/)}
=> 1.270000 0.000000 1.270000 ( 1.331137)

Benchmark.measure{val.match(/a.*b/)}
=> 0.460000 0.000000 0.460000 ( 0.487267)

Where does ‘val’ come from and what does it reference?

As I just mentioned, I included the string creating as part of the
benchmark for simple reproduction. While I was figuring out the answer
myself, I assigned val = "a" * 10_000 for ease of use.

I added that final example shortly before submitting, and simply did not
notice that I would need to re-write it to

engines. Writing this message I think I got a better grasp of the cause
of the issue, but I still feel that it needs to be addressed for the
sake of security.

Before we jump to conclusions please provide the proper and complete
test. Note also that it is a common fact that NFA’s can suffer from
backtracking in certain conditions. So I am not too surprised nor
worried. I did not look closely at all your test cases but often
backtracking issues can be remedied by using greediness modifiers
(either “greedy” or “reluctant”).

Kind regards

robert

I am not sure what sort of proper and complete tests you are looking
for. Regarding the actual slowdown, the issue comes down to the engine
repeating the match from the start for every single “a” in the string,
then failing and trying again with the next letter. I do not see how a
greedy modifier can help in this situation, and I do not know enough
about the ruby regex engine to comment on the underlying design.

tikhon_b · February 22, 2013, 3:32pm

On Fri, Feb 22, 2013 at 1:21 AM, Tikhon B. [email protected] wrote:

Robert K. wrote in post #1098199:

On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. [email protected]
wrote:

You are doing the measurement wrong. You do not only measure matching
but also creation of the String. Given that fact, all your results
are moot.

I actively considered this matter when submitting the post, but in the
end I included the string creating in the test to allow anyone to try
out the example by copying one line.

Having a second line which defines a variable or constant isn’t really
that more inconvenient.

The time necessary for the actual
creation process is negligible; I need to run the string creation ten
thousand times before the result is substantial enough to significantly
affect the numbers I’m getting.

Still you are measuring one thing but reasoning about another.

I am not sure what sort of proper and complete tests you are looking
for.

Tests which measure exactly the functionality you want to scrutinize
not something else.

Regarding the actual slowdown, the issue comes down to the engine
repeating the match from the start for every single “a” in the string,
then failing and trying again with the next letter.

Backtracking.

I do not see how a
greedy modifier can help in this situation, and I do not know enough
about the ruby regex engine to comment on the underlying design.

Mine was a general statement about issues that can be avoided with
greediness modifiers. I didn’t say it’s the solution to all
performance issues of this kind.

I had a closer look at your expressions. They are of course
artificial to demonstrate the effects of backtracking. I would assume
that one would typically write different expressions in real
applications, for example, using anchoring which dramatically improves
the situation here.

$ ruby rx-bm.rb
user system total real
0 /(?>a[^b]b)/ 11.732000 0.000000 11.732000 ( 11.737671)
1 /(?>a[^b]b)/ 0.015000 0.000000 0.015000 ( 0.003000)
0 /\A(?>a[^b]b)/ 0.000000 0.000000 0.000000 ( 0.002000)
1 /\A(?>a[^b]b)/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /(?<!a)(?>a[^b]b)/ 0.016000 0.000000 0.016000 ( 0.012001)
1 /(?<!a)(?>a[^b]b)/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /aab/ 11.700000 0.000000 11.700000 ( 11.703670)
1 /aab/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /\Aaab/ 0.000000 0.000000 0.000000 ( 0.002000)
1 /\Aaab/ 0.015000 0.000000 0.015000 ( 0.002000)
0 /(?<!a)aab/ 0.000000 0.000000 0.000000 ( 0.013001)
1 /(?<!a)aab/ 0.016000 0.000000 0.016000 ( 0.002000)
0 /a+b/ 11.248000 0.000000 11.248000 ( 11.252644)
1 /a+b/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.003000)
1 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /(?<!a)a+b/ 0.015000 0.000000 0.015000 ( 0.012001)
1 /(?<!a)a+b/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /a.*b/ 4.649000 0.000000 4.649000 ( 4.641265)
1 /a.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
0 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
1 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
0 /(?<!a)a.*b/ 0.016000 0.000000 0.016000 ( 0.011001)
1 /(?<!a)a.*b/ 0.000000 0.000000 0.000000 ( 0.000000)
$ cat -n rx-bm.rb
1
2 require ‘benchmark’
3
4 strings = [
5 (“a” * 10_000).freeze,
6 (“a” * 10_000 + “b”).freeze
7 ]
8
9 data = [
10 /(?>a[^b]b)/,
11 /\A(?>a[^b]b)/,
12 /(?<!a)(?>a[^b]b)/,
13
14 /aab/,
15 /\Aaab/,
16 /(?<!a)aab/,
17
18 /a+b/,
19 /\Aa+b/,
20 /(?<!a)a+b/,
21
22 /a.*b/,
23 /\Aa.*b/,
24 /(?<!a)a.*b/,
25 ]
26
27 REP = 20
28
29 Benchmark.bm 25 do |x|
30 data.each do |rx|
31 strings.each_with_index do |s, i|
32 x.report sprintf(“%2d %p”, i, rx) do
33 REP.times do
34 rx.match s
35 end
36 end
37 end
38 end
39 end
$

So, yes, there is a cost for backtracking involved, but it only shows
only in particular situations:

large inputs
specific expressions (absence of anchoring, using unlimited
repetition operators vs. limited (e.g. /a{0,10}/).

The phenomenon is well known with regular expression engines I
believe. The cases you mention seem to be rather the exception than
the common state of affairs. Which doesn’t mean that Ruby’s engine
cannot be improved. I just don’t think it’s not that dramatic an
issue.

Kind regards

robert

tikhon_b · February 22, 2013, 5:09pm

On Fri, Feb 22, 2013 at 4:24 PM, Tikhon B. [email protected] wrote:

I took another look at the anchors ruby provides, and I found that the
\G anchor actually solves the issue outlined in this particular example
as it applies in my real use case to my satisfaction.

Good!

I’m still a bit unsettled that there is such a caveat to consider when
writing a relatively benign expression.

I actually believe this is just one more item to watch out for when
creating regular expressions. Many expressions I see posted (here and
elsewhere) are done a bit sloppy and will match properly with the few
inputs the user is considering right now but break with other inputs.
I believe you can say the art (?) of creating regular expressions is
not well developed on average.

However, given what the symptoms
of the problem imply about the engine design, I suspect fixing the issue
would take quite a bit of either hackery, or a fair amount of re-design.

Have you read “Mastering Regular Expressions”? If not, it’s a good
read about the matter. Backtracking occurs in every regexp engine
which is implemented with a NFA and so every of those engines is in
principle vulnerable here. It’s different for DFA based engines which
do not need backtracking. But today’s regexp engines usually use a
NFA (with sed being a prominent exception) because of the greater
flexibility. It’s a trade off and you always pay a price.

Cheers

robert

tikhon_b · February 22, 2013, 7:42pm

On 02/22/2013 07:24 AM, Tikhon B. wrote:

cannot be improved. I just don’t think it’s not that dramatic an
I’m still a bit unsettled that there is such a caveat to consider when

You might be interested in the following pages:

-Justin

tikhon_b · February 22, 2013, 4:24pm

Robert K. wrote in post #1098439:

So, yes, there is a cost for backtracking involved, but it only shows
only in particular situations:

large inputs

specific expressions (absence of anchoring, using unlimited
repetition operators vs. limited (e.g. /a{0,10}/).

The phenomenon is well known with regular expression engines I
believe. The cases you mention seem to be rather the exception than
the common state of affairs. Which doesn’t mean that Ruby’s engine
cannot be improved. I just don’t think it’s not that dramatic an
issue.

Kind regards

robert

I took another look at the anchors ruby provides, and I found that the
\G anchor actually solves the issue outlined in this particular example
as it applies in my real use case to my satisfaction.

I’m still a bit unsettled that there is such a caveat to consider when
writing a relatively benign expression. However, given what the symptoms
of the problem imply about the engine design, I suspect fixing the issue
would take quite a bit of either hackery, or a fair amount of re-design.

I’ll chalk it up to yet another one of those strange landmines you
encounter in ruby every once in a while.

Thanks,

Tikhon