Forum: Ruby Questionable regex performance when using lazy matching and inverted character classes

Posted by Tikhon B. (tikhon_b)
on 2013-02-21 12:12
Hi All,

I am trying to figure out a performance question with the ruby regex
matcher. Running the following code takes excessively longer than
expected.

1. Benchmark.measure {("a" * 10_000).match(/a.*?b/)}
=>   1.140000   0.010000   1.150000 (  1.195161)

2. Benchmark.measure {("a" * 10_000).match(/a[^b]*b/)}
=>   1.400000   0.000000   1.400000 (  1.462178)

Thinking the parser was doing some unnecessary backtracking I also tried
atomic grouping, and other simpler expressions with very limited
success:

3. Benchmark.measure {("a" * 10_000).match(/(?>a[^b]*b)/)}
=>   1.400000   0.000000   1.400000 (  1.462178)

4. Benchmark.measure {("a" * 10_000).match(/aa*b/)}
=>   1.270000   0.000000   1.270000 (  1.331137)

5. Benchmark.measure{val.match(/a.*b/)}
=>   0.460000   0.000000   0.460000 (  0.487267)

Simplifying the expression further produces the expected results.

6. Benchmark.measure {("a" * 10_000).match(/a*b/)}
=>   0.000000   0.000000   0.000000 (  0.000093)

As best as I can tell this is caused by the repetition of the first
character in the expression in the string:

7. Benchmark.measure{("a" + "c" * 9999).match(/ac*b/)}
=>   0.000000   0.000000   0.000000 (  0.000372)

This may be a very simple vector for a DDOS attack against a server
using a similar expression to verify input data. What more, since valid
input data is not likely to resemble scenarios 1-6 the issue may go
undetected by a developers and testers not aware of this problem.

Conversely, other developers may be turned off from using a regex in a
place where it would perform well due to simple performance testing as
above.

This sort of performance hit is in neither the NodeJS nor Python regex
engines. Writing this message I think I got a better grasp of the cause
of the issue, but I still feel that it needs to be addressed for the
sake of security.
Posted by Robert Klemme (robert_k78)
on 2013-02-21 12:34
(Received via mailing list)
On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. <lists@ruby-forum.com> 
wrote:
> Hi All,
>
> I am trying to figure out a performance question with the ruby regex
> matcher. Running the following code takes excessively longer than
> expected.

You are doing the measurement wrong.  You do not only measure matching
but also creation of the String.  Given that fact, all your results
are moot.

> 3. Benchmark.measure {("a" * 10_000).match(/(?>a[^b]*b)/)}
> =>   1.400000   0.000000   1.400000 (  1.462178)
>
> 4. Benchmark.measure {("a" * 10_000).match(/aa*b/)}
> =>   1.270000   0.000000   1.270000 (  1.331137)
>
> 5. Benchmark.measure{val.match(/a.*b/)}
> =>   0.460000   0.000000   0.460000 (  0.487267)

Where does 'val' come from and what does it reference?

> engines. Writing this message I think I got a better grasp of the cause
> of the issue, but I still feel that it needs to be addressed for the
> sake of security.

Before we jump to conclusions please provide the proper and complete
test.  Note also that it is a common fact that NFA's can suffer from
backtracking in certain conditions.  So I am not too surprised nor
worried.  I did not look closely at all your test cases but often
backtracking issues can be remedied by using greediness modifiers
(either "greedy" or "reluctant").

Kind regards

robert
Posted by Tikhon B. (tikhon_b)
on 2013-02-22 01:21
Robert Klemme wrote in post #1098199:
> On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. <lists@ruby-forum.com>
> wrote:
>> Hi All,
>>
>> I am trying to figure out a performance question with the ruby regex
>> matcher. Running the following code takes excessively longer than
>> expected.
>
> You are doing the measurement wrong.  You do not only measure matching
> but also creation of the String.  Given that fact, all your results
> are moot.

I actively considered this matter when submitting the post, but in the
end I included the string creating in the test to allow anyone to try
out the example by copying one line. The time necessary for the actual
creation process is negligible; I need to run the string creation ten
thousand times before the result is substantial enough to significantly
affect the numbers I'm getting.

Benchmark.measure {10_000.times {"a" * 10_000}}
=>   0.100000   0.000000   0.100000 (  0.105442)

Case in point:

val = "a" * 10_000
3a. Benchmark.measure {val.match(/(?>a[^b]*b)/)}
=>   1.430000   0.000000   1.430000 (  1.490874)

4a. Benchmark.measure {val.match(/aa*b/)}
=>   1.310000   0.000000   1.310000 (  1.366131)

>> 3. Benchmark.measure {("a" * 10_000).match(/(?>a[^b]*b)/)}
>> =>   1.400000   0.000000   1.400000 (  1.462178)
>>
>> 4. Benchmark.measure {("a" * 10_000).match(/aa*b/)}
>> =>   1.270000   0.000000   1.270000 (  1.331137)
>>
>> 5. Benchmark.measure{val.match(/a.*b/)}
>> =>   0.460000   0.000000   0.460000 (  0.487267)
>
> Where does 'val' come from and what does it reference?

As I just mentioned, I included the string creating as part of the
benchmark for simple reproduction. While I was figuring out the answer
myself, I assigned `val = "a" * 10_000` for ease of use.

I added that final example shortly before submitting, and simply did not
notice that I would need to re-write it to

>> engines. Writing this message I think I got a better grasp of the cause
>> of the issue, but I still feel that it needs to be addressed for the
>> sake of security.
>
> Before we jump to conclusions please provide the proper and complete
> test.  Note also that it is a common fact that NFA's can suffer from
> backtracking in certain conditions.  So I am not too surprised nor
> worried.  I did not look closely at all your test cases but often
> backtracking issues can be remedied by using greediness modifiers
> (either "greedy" or "reluctant").
>
> Kind regards
>
> robert

I am not sure what sort of proper and complete tests you are looking
for. Regarding the actual slowdown, the issue comes down to the engine
repeating the match from the start for every single "a" in the string,
then failing and trying again with the next letter. I do not see how a
greedy modifier can help in this situation, and I do not know enough
about the ruby regex engine to comment on the underlying design.
Posted by Robert Klemme (robert_k78)
on 2013-02-22 15:32
(Received via mailing list)
On Fri, Feb 22, 2013 at 1:21 AM, Tikhon B. <lists@ruby-forum.com> wrote:
> Robert Klemme wrote in post #1098199:
>> On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. <lists@ruby-forum.com>
>> wrote:

>> You are doing the measurement wrong.  You do not only measure matching
>> but also creation of the String.  Given that fact, all your results
>> are moot.
>
> I actively considered this matter when submitting the post, but in the
> end I included the string creating in the test to allow anyone to try
> out the example by copying one line.

Having a second line which defines a variable or constant isn't really
that more inconvenient.

> The time necessary for the actual
> creation process is negligible; I need to run the string creation ten
> thousand times before the result is substantial enough to significantly
> affect the numbers I'm getting.

Still you are measuring one thing but reasoning about another.

> I am not sure what sort of proper and complete tests you are looking
> for.

Tests which measure exactly the functionality you want to scrutinize
not something else.

> Regarding the actual slowdown, the issue comes down to the engine
> repeating the match from the start for every single "a" in the string,
> then failing and trying again with the next letter.

Backtracking.

> I do not see how a
> greedy modifier can help in this situation, and I do not know enough
> about the ruby regex engine to comment on the underlying design.

Mine was a general statement about issues that can be avoided with
greediness modifiers.  I didn't say it's the solution to all
performance issues of this kind.

I had a closer look at your expressions.  They are of course
artificial to demonstrate the effects of backtracking.  I would assume
that one would typically write different expressions in real
applications, for example, using anchoring which dramatically improves
the situation here.

$ ruby rx-bm.rb
                                user     system      total        real
 0 /(?>a[^b]*b)/           11.732000   0.000000  11.732000 ( 11.737671)
 1 /(?>a[^b]*b)/            0.015000   0.000000   0.015000 (  0.003000)
 0 /\A(?>a[^b]*b)/          0.000000   0.000000   0.000000 (  0.002000)
 1 /\A(?>a[^b]*b)/          0.000000   0.000000   0.000000 (  0.003000)
 0 /(?<!a)(?>a[^b]*b)/      0.016000   0.000000   0.016000 (  0.012001)
 1 /(?<!a)(?>a[^b]*b)/      0.000000   0.000000   0.000000 (  0.003000)
 0 /aa*b/                  11.700000   0.000000  11.700000 ( 11.703670)
 1 /aa*b/                   0.000000   0.000000   0.000000 (  0.002000)
 0 /\Aaa*b/                 0.000000   0.000000   0.000000 (  0.002000)
 1 /\Aaa*b/                 0.015000   0.000000   0.015000 (  0.002000)
 0 /(?<!a)aa*b/             0.000000   0.000000   0.000000 (  0.013001)
 1 /(?<!a)aa*b/             0.016000   0.000000   0.016000 (  0.002000)
 0 /a+b/                   11.248000   0.000000  11.248000 ( 11.252644)
 1 /a+b/                    0.000000   0.000000   0.000000 (  0.002000)
 0 /\Aa+b/                  0.000000   0.000000   0.000000 (  0.003000)
 1 /\Aa+b/                  0.000000   0.000000   0.000000 (  0.002000)
 0 /(?<!a)a+b/              0.015000   0.000000   0.015000 (  0.012001)
 1 /(?<!a)a+b/              0.000000   0.000000   0.000000 (  0.003000)
 0 /a.*b/                   4.649000   0.000000   4.649000 (  4.641265)
 1 /a.*b/                   0.000000   0.000000   0.000000 (  0.001000)
 0 /\Aa.*b/                 0.000000   0.000000   0.000000 (  0.001000)
 1 /\Aa.*b/                 0.000000   0.000000   0.000000 (  0.001000)
 0 /(?<!a)a.*b/             0.016000   0.000000   0.016000 (  0.011001)
 1 /(?<!a)a.*b/             0.000000   0.000000   0.000000 (  0.000000)
$ cat -n rx-bm.rb
     1
     2  require 'benchmark'
     3
     4  strings = [
     5    ("a" * 10_000).freeze,
     6    ("a" * 10_000 + "b").freeze
     7  ]
     8
     9  data = [
    10          /(?>a[^b]*b)/,
    11          /\A(?>a[^b]*b)/,
    12          /(?<!a)(?>a[^b]*b)/,
    13
    14          /aa*b/,
    15          /\Aaa*b/,
    16          /(?<!a)aa*b/,
    17
    18          /a+b/,
    19          /\Aa+b/,
    20          /(?<!a)a+b/,
    21
    22          /a.*b/,
    23          /\Aa.*b/,
    24          /(?<!a)a.*b/,
    25  ]
    26
    27  REP = 20
    28
    29  Benchmark.bm 25 do |x|
    30    data.each do |rx|
    31      strings.each_with_index do |s, i|
    32        x.report sprintf("%2d %p", i, rx) do
    33          REP.times do
    34            rx.match s
    35          end
    36        end
    37      end
    38    end
    39  end
$

So, yes, there is a cost for backtracking involved, but it only shows
only in particular situations:
- large inputs
- specific expressions (absence of anchoring, using unlimited
repetition operators vs. limited (e.g. /a{0,10}/).

The phenomenon is well known with regular expression engines I
believe.  The cases you mention seem to be rather the exception than
the common state of affairs.  Which doesn't mean that Ruby's engine
cannot be improved.  I just don't think it's not that dramatic an
issue.

Kind regards

robert
Posted by Tikhon B. (tikhon_b)
on 2013-02-22 16:24
Robert Klemme wrote in post #1098439:
> So, yes, there is a cost for backtracking involved, but it only shows
> only in particular situations:
> - large inputs
> - specific expressions (absence of anchoring, using unlimited
> repetition operators vs. limited (e.g. /a{0,10}/).
>
> The phenomenon is well known with regular expression engines I
> believe.  The cases you mention seem to be rather the exception than
> the common state of affairs.  Which doesn't mean that Ruby's engine
> cannot be improved.  I just don't think it's not that dramatic an
> issue.
>
> Kind regards
>
> robert

I took another look at the anchors ruby provides, and I found that the
\G anchor actually solves the issue outlined in this particular example
as it applies in my real use case to my satisfaction.

I'm still a bit unsettled that there is such a caveat to consider when
writing a relatively benign expression. However, given what the symptoms
of the problem imply about the engine design, I suspect fixing the issue
would take quite a bit of either hackery, or a fair amount of re-design.

I'll chalk it up to yet another one of those strange landmines you
encounter in ruby every once in a while.

Thanks,

Tikhon
Posted by Robert Klemme (robert_k78)
on 2013-02-22 17:09
(Received via mailing list)
On Fri, Feb 22, 2013 at 4:24 PM, Tikhon B. <lists@ruby-forum.com> wrote:

> I took another look at the anchors ruby provides, and I found that the
> \G anchor actually solves the issue outlined in this particular example
> as it applies in my real use case to my satisfaction.

Good!

> I'm still a bit unsettled that there is such a caveat to consider when
> writing a relatively benign expression.

I actually believe this is just one more item to watch out for when
creating regular expressions.  Many expressions I see posted (here and
elsewhere) are done a bit sloppy and will match properly with the few
inputs the user is considering right now but break with other inputs.
I believe you can say the art (?) of creating regular expressions is
not well developed on average.

> However, given what the symptoms
> of the problem imply about the engine design, I suspect fixing the issue
> would take quite a bit of either hackery, or a fair amount of re-design.

Have you read "Mastering Regular Expressions"?  If not, it's a good
read about the matter.  Backtracking occurs in every regexp engine
which is implemented with a NFA and so every of those engines is in
principle vulnerable here.  It's different for DFA based engines which
do not need backtracking.  But today's regexp engines usually use a
NFA (with sed being a prominent exception) because of the greater
flexibility.  It's a trade off and you always pay a price. :-)

Cheers

robert
Posted by Justin Collins (Guest)
on 2013-02-22 19:42
(Received via mailing list)
On 02/22/2013 07:24 AM, Tikhon B. wrote:
>> cannot be improved.  I just don't think it's not that dramatic an
> I'm still a bit unsettled that there is such a caveat to consider when
>
You might be interested in the following pages:

https://en.wikipedia.org/wiki/ReDoS
http://www.regular-expressions.info/catastrophic.html


-Justin
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.