Hi All,
I am trying to figure out a performance question with the ruby regex
matcher. Running the following code takes excessively longer than
expected.
1. Benchmark.measure {("a" * 10_000).match(/a.*?b/)}
=> 1.140000 0.010000 1.150000 ( 1.195161)
2. Benchmark.measure {("a" * 10_000).match(/a[^b]*b/)}
=> 1.400000 0.000000 1.400000 ( 1.462178)
Thinking the parser was doing some unnecessary backtracking I also tried
atomic grouping, and other simpler expressions with very limited
success:
3. Benchmark.measure {("a" * 10_000).match(/(?>a[^b]*b)/)}
=> 1.400000 0.000000 1.400000 ( 1.462178)
4. Benchmark.measure {("a" * 10_000).match(/aa*b/)}
=> 1.270000 0.000000 1.270000 ( 1.331137)
5. Benchmark.measure{val.match(/a.*b/)}
=> 0.460000 0.000000 0.460000 ( 0.487267)
Simplifying the expression further produces the expected results.
6. Benchmark.measure {("a" * 10_000).match(/a*b/)}
=> 0.000000 0.000000 0.000000 ( 0.000093)
As best as I can tell this is caused by the repetition of the first
character in the expression in the string:
7. Benchmark.measure{("a" + "c" * 9999).match(/ac*b/)}
=> 0.000000 0.000000 0.000000 ( 0.000372)
This may be a very simple vector for a DDOS attack against a server
using a similar expression to verify input data. What more, since valid
input data is not likely to resemble scenarios 1-6 the issue may go
undetected by a developers and testers not aware of this problem.
Conversely, other developers may be turned off from using a regex in a
place where it would perform well due to simple performance testing as
above.
This sort of performance hit is in neither the NodeJS nor Python regex
engines. Writing this message I think I got a better grasp of the cause
of the issue, but I still feel that it needs to be addressed for the
sake of security.
on 2013-02-21 12:12
on 2013-02-21 12:34
On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. <lists@ruby-forum.com> wrote: > Hi All, > > I am trying to figure out a performance question with the ruby regex > matcher. Running the following code takes excessively longer than > expected. You are doing the measurement wrong. You do not only measure matching but also creation of the String. Given that fact, all your results are moot. > 3. Benchmark.measure {("a" * 10_000).match(/(?>a[^b]*b)/)} > => 1.400000 0.000000 1.400000 ( 1.462178) > > 4. Benchmark.measure {("a" * 10_000).match(/aa*b/)} > => 1.270000 0.000000 1.270000 ( 1.331137) > > 5. Benchmark.measure{val.match(/a.*b/)} > => 0.460000 0.000000 0.460000 ( 0.487267) Where does 'val' come from and what does it reference? > engines. Writing this message I think I got a better grasp of the cause > of the issue, but I still feel that it needs to be addressed for the > sake of security. Before we jump to conclusions please provide the proper and complete test. Note also that it is a common fact that NFA's can suffer from backtracking in certain conditions. So I am not too surprised nor worried. I did not look closely at all your test cases but often backtracking issues can be remedied by using greediness modifiers (either "greedy" or "reluctant"). Kind regards robert
on 2013-02-22 01:21
Robert Klemme wrote in post #1098199: > On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. <lists@ruby-forum.com> > wrote: >> Hi All, >> >> I am trying to figure out a performance question with the ruby regex >> matcher. Running the following code takes excessively longer than >> expected. > > You are doing the measurement wrong. You do not only measure matching > but also creation of the String. Given that fact, all your results > are moot. I actively considered this matter when submitting the post, but in the end I included the string creating in the test to allow anyone to try out the example by copying one line. The time necessary for the actual creation process is negligible; I need to run the string creation ten thousand times before the result is substantial enough to significantly affect the numbers I'm getting. Benchmark.measure {10_000.times {"a" * 10_000}} => 0.100000 0.000000 0.100000 ( 0.105442) Case in point: val = "a" * 10_000 3a. Benchmark.measure {val.match(/(?>a[^b]*b)/)} => 1.430000 0.000000 1.430000 ( 1.490874) 4a. Benchmark.measure {val.match(/aa*b/)} => 1.310000 0.000000 1.310000 ( 1.366131) >> 3. Benchmark.measure {("a" * 10_000).match(/(?>a[^b]*b)/)} >> => 1.400000 0.000000 1.400000 ( 1.462178) >> >> 4. Benchmark.measure {("a" * 10_000).match(/aa*b/)} >> => 1.270000 0.000000 1.270000 ( 1.331137) >> >> 5. Benchmark.measure{val.match(/a.*b/)} >> => 0.460000 0.000000 0.460000 ( 0.487267) > > Where does 'val' come from and what does it reference? As I just mentioned, I included the string creating as part of the benchmark for simple reproduction. While I was figuring out the answer myself, I assigned `val = "a" * 10_000` for ease of use. I added that final example shortly before submitting, and simply did not notice that I would need to re-write it to >> engines. Writing this message I think I got a better grasp of the cause >> of the issue, but I still feel that it needs to be addressed for the >> sake of security. > > Before we jump to conclusions please provide the proper and complete > test. Note also that it is a common fact that NFA's can suffer from > backtracking in certain conditions. So I am not too surprised nor > worried. I did not look closely at all your test cases but often > backtracking issues can be remedied by using greediness modifiers > (either "greedy" or "reluctant"). > > Kind regards > > robert I am not sure what sort of proper and complete tests you are looking for. Regarding the actual slowdown, the issue comes down to the engine repeating the match from the start for every single "a" in the string, then failing and trying again with the next letter. I do not see how a greedy modifier can help in this situation, and I do not know enough about the ruby regex engine to comment on the underlying design.
on 2013-02-22 15:32
On Fri, Feb 22, 2013 at 1:21 AM, Tikhon B. <lists@ruby-forum.com> wrote: > Robert Klemme wrote in post #1098199: >> On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. <lists@ruby-forum.com> >> wrote: >> You are doing the measurement wrong. You do not only measure matching >> but also creation of the String. Given that fact, all your results >> are moot. > > I actively considered this matter when submitting the post, but in the > end I included the string creating in the test to allow anyone to try > out the example by copying one line. Having a second line which defines a variable or constant isn't really that more inconvenient. > The time necessary for the actual > creation process is negligible; I need to run the string creation ten > thousand times before the result is substantial enough to significantly > affect the numbers I'm getting. Still you are measuring one thing but reasoning about another. > I am not sure what sort of proper and complete tests you are looking > for. Tests which measure exactly the functionality you want to scrutinize not something else. > Regarding the actual slowdown, the issue comes down to the engine > repeating the match from the start for every single "a" in the string, > then failing and trying again with the next letter. Backtracking. > I do not see how a > greedy modifier can help in this situation, and I do not know enough > about the ruby regex engine to comment on the underlying design. Mine was a general statement about issues that can be avoided with greediness modifiers. I didn't say it's the solution to all performance issues of this kind. I had a closer look at your expressions. They are of course artificial to demonstrate the effects of backtracking. I would assume that one would typically write different expressions in real applications, for example, using anchoring which dramatically improves the situation here. $ ruby rx-bm.rb user system total real 0 /(?>a[^b]*b)/ 11.732000 0.000000 11.732000 ( 11.737671) 1 /(?>a[^b]*b)/ 0.015000 0.000000 0.015000 ( 0.003000) 0 /\A(?>a[^b]*b)/ 0.000000 0.000000 0.000000 ( 0.002000) 1 /\A(?>a[^b]*b)/ 0.000000 0.000000 0.000000 ( 0.003000) 0 /(?<!a)(?>a[^b]*b)/ 0.016000 0.000000 0.016000 ( 0.012001) 1 /(?<!a)(?>a[^b]*b)/ 0.000000 0.000000 0.000000 ( 0.003000) 0 /aa*b/ 11.700000 0.000000 11.700000 ( 11.703670) 1 /aa*b/ 0.000000 0.000000 0.000000 ( 0.002000) 0 /\Aaa*b/ 0.000000 0.000000 0.000000 ( 0.002000) 1 /\Aaa*b/ 0.015000 0.000000 0.015000 ( 0.002000) 0 /(?<!a)aa*b/ 0.000000 0.000000 0.000000 ( 0.013001) 1 /(?<!a)aa*b/ 0.016000 0.000000 0.016000 ( 0.002000) 0 /a+b/ 11.248000 0.000000 11.248000 ( 11.252644) 1 /a+b/ 0.000000 0.000000 0.000000 ( 0.002000) 0 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.003000) 1 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.002000) 0 /(?<!a)a+b/ 0.015000 0.000000 0.015000 ( 0.012001) 1 /(?<!a)a+b/ 0.000000 0.000000 0.000000 ( 0.003000) 0 /a.*b/ 4.649000 0.000000 4.649000 ( 4.641265) 1 /a.*b/ 0.000000 0.000000 0.000000 ( 0.001000) 0 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000) 1 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000) 0 /(?<!a)a.*b/ 0.016000 0.000000 0.016000 ( 0.011001) 1 /(?<!a)a.*b/ 0.000000 0.000000 0.000000 ( 0.000000) $ cat -n rx-bm.rb 1 2 require 'benchmark' 3 4 strings = [ 5 ("a" * 10_000).freeze, 6 ("a" * 10_000 + "b").freeze 7 ] 8 9 data = [ 10 /(?>a[^b]*b)/, 11 /\A(?>a[^b]*b)/, 12 /(?<!a)(?>a[^b]*b)/, 13 14 /aa*b/, 15 /\Aaa*b/, 16 /(?<!a)aa*b/, 17 18 /a+b/, 19 /\Aa+b/, 20 /(?<!a)a+b/, 21 22 /a.*b/, 23 /\Aa.*b/, 24 /(?<!a)a.*b/, 25 ] 26 27 REP = 20 28 29 Benchmark.bm 25 do |x| 30 data.each do |rx| 31 strings.each_with_index do |s, i| 32 x.report sprintf("%2d %p", i, rx) do 33 REP.times do 34 rx.match s 35 end 36 end 37 end 38 end 39 end $ So, yes, there is a cost for backtracking involved, but it only shows only in particular situations: - large inputs - specific expressions (absence of anchoring, using unlimited repetition operators vs. limited (e.g. /a{0,10}/). The phenomenon is well known with regular expression engines I believe. The cases you mention seem to be rather the exception than the common state of affairs. Which doesn't mean that Ruby's engine cannot be improved. I just don't think it's not that dramatic an issue. Kind regards robert
on 2013-02-22 16:24
Robert Klemme wrote in post #1098439: > So, yes, there is a cost for backtracking involved, but it only shows > only in particular situations: > - large inputs > - specific expressions (absence of anchoring, using unlimited > repetition operators vs. limited (e.g. /a{0,10}/). > > The phenomenon is well known with regular expression engines I > believe. The cases you mention seem to be rather the exception than > the common state of affairs. Which doesn't mean that Ruby's engine > cannot be improved. I just don't think it's not that dramatic an > issue. > > Kind regards > > robert I took another look at the anchors ruby provides, and I found that the \G anchor actually solves the issue outlined in this particular example as it applies in my real use case to my satisfaction. I'm still a bit unsettled that there is such a caveat to consider when writing a relatively benign expression. However, given what the symptoms of the problem imply about the engine design, I suspect fixing the issue would take quite a bit of either hackery, or a fair amount of re-design. I'll chalk it up to yet another one of those strange landmines you encounter in ruby every once in a while. Thanks, Tikhon
on 2013-02-22 17:09
On Fri, Feb 22, 2013 at 4:24 PM, Tikhon B. <lists@ruby-forum.com> wrote: > I took another look at the anchors ruby provides, and I found that the > \G anchor actually solves the issue outlined in this particular example > as it applies in my real use case to my satisfaction. Good! > I'm still a bit unsettled that there is such a caveat to consider when > writing a relatively benign expression. I actually believe this is just one more item to watch out for when creating regular expressions. Many expressions I see posted (here and elsewhere) are done a bit sloppy and will match properly with the few inputs the user is considering right now but break with other inputs. I believe you can say the art (?) of creating regular expressions is not well developed on average. > However, given what the symptoms > of the problem imply about the engine design, I suspect fixing the issue > would take quite a bit of either hackery, or a fair amount of re-design. Have you read "Mastering Regular Expressions"? If not, it's a good read about the matter. Backtracking occurs in every regexp engine which is implemented with a NFA and so every of those engines is in principle vulnerable here. It's different for DFA based engines which do not need backtracking. But today's regexp engines usually use a NFA (with sed being a prominent exception) because of the greater flexibility. It's a trade off and you always pay a price. :-) Cheers robert
on 2013-02-22 19:42
On 02/22/2013 07:24 AM, Tikhon B. wrote: >> cannot be improved. I just don't think it's not that dramatic an > I'm still a bit unsettled that there is such a caveat to consider when > You might be interested in the following pages: https://en.wikipedia.org/wiki/ReDoS http://www.regular-expressions.info/catastrophic.html -Justin
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.