Slow ruby regexes


#1

Hello i’ve been reading this article, wich has a few benchmarks
(inclding a ruby related one) and many theorical explanations of regex
matching algorithms:

http://swtch.com/~rsc/regexp/regexp1.html

I became a little worried since i’m making hevy use of regexes in a
program of mine that shortly i’ll have to run on each of thousands of
text files.

  • Do anyone now if what is said relates to ruby 1.8.6
  • If so, is it going to change ?

Thank you!


#2

On Wed, 11 Apr 2007 02:12:09 +0900, Emmanuel removed_email_address@domain.invalid
wrote:

I became a little worried since i’m making hevy use of regexes in a
program of mine that shortly i’ll have to run on each of thousands of
text files.

Ruby’s regexp implementation is a backtracking matcher similar to Perl’s
– while the article’s criticism is valid, you should be fine if you
write your regular expressions to minimize the amount of backtracking
required.

(The author’s point is that such care is unnecessary with an NFA-based
regexp implementation.)

-mental


#3

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

For example, a regex was shown in this list lately that tells you if a
number is prime. That needs backtracking.

Aur


#4

On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit removed_email_address@domain.invalid
wrote:

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

It’s unfortunate, though, that an NFA-based approach isn’t selectively
used for those expressions which don’t use features like backreferences
(i.e. the majority of regular expressions people write). The
performance is so much better it’s not funny.

-mental


#5

On Wed, 11 Apr 2007 03:56:28 +0900, “Robert D.”
removed_email_address@domain.invalid wrote:

The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

No, the bad news is that the bad design of most “regex” implementations
will not let you escape with using “regexes” as if they were true
regular expressions; knowing a lot about the implementation of the regex
engine is a necessary thing to use them well.

But you are right of course that Ruby’s current Regex implementation
is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but
in other respects we’ve actually regressed since then.

I’m not suggesting we abandon backreferences and so forth (which are
useful sometimes), but rather that NFAs should be used for evaluating
the majority of “regexes” which don’t use non-regular features.

-mental


#6

On Tuesday 10 April 2007 19:14, MenTaLguY wrote:

On Wed, 11 Apr 2007 03:56:28 +0900, “Robert D.” removed_email_address@domain.invalid
wrote:

is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in
other respects we’ve actually regressed since then.

I’m not suggesting we abandon backreferences and so forth (which are useful
sometimes), but rather that NFAs should be used for evaluating the majority
of “regexes” which don’t use non-regular features.

-mental

Well topic is rather interesting, i’ll try to check it soon, and i’ll
try to
write ruby interface for c library that author of article reference, i
was
looking for good case to test including c into ruby.
I’m not promising anything tho :slight_smile:

regexps are great tool for text processing, but serious log analyzers
(like
one my company use) are written in c and base on lots of low-level
string
manipulation, raw binary access to DB et cecera, not to mention that
ruby is
much much slower then c anyway.

Greets
Marcin R.


#7

On 4/10/07, MenTaLguY removed_email_address@domain.invalid wrote:

On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit removed_email_address@domain.invalid wrote:

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

It’s unfortunate, though, that an NFA-based approach isn’t selectively used for those expressions which don’t use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it’s not funny.

Well the good news is that Ruby 2 Regexs will be much better
performing, I do not know if they will be NFA, though.
The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

But you are right of course that Ruby’s current Regex implementation
is not state of the art, useful, very useful nevertheless.

Cheers
Robert


#8

oops wrong button here :frowning:

On 4/11/07, Robert D. removed_email_address@domain.invalid wrote:

On 4/10/07, MenTaLguY removed_email_address@domain.invalid wrote:

On Wed, 11 Apr 2007 03:56:28 +0900, “Robert D.” removed_email_address@domain.invalid wrote:

The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

No, the bad news is that the bad design of most “regex” implementations will not let you escape with using “regexes” as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.
That might be, I am not qualified to discuss this issue.
What I actually meant is that given the need for backtracking
abilities it will just not be possible to “guess” what the user meant
when he/she uses too general an expression and I am talking about NFA
here.
Well just wanted to clear that point up.

But you are right of course that Ruby’s current Regex implementation
is not state of the art, useful, very useful nevertheless.

Cheers


#9

On Wed, 11 Apr 2007 16:53:26 +0900, “Robert D.”
removed_email_address@domain.invalid wrote:

What I actually meant is that given the need for backtracking
abilities it will just not be possible to “guess” what the user meant
when he/she uses too general an expression and I am talking about NFA
here.

I’m a little confused – the regexp features which require backtracking
correspond to well-defined elements of the syntax. There isn’t any need
to guess whether the user requires backtracking or not, one can simply
see whether such expressions are present in a particular regexp.

For “pure” regular expressions (which can be matched by NFAs),
backtracking and NFA-based evaluation yield equivalent results, except
that the NFA approach is O(n) in the number of characters, rather than
backtracking’s worst-case exponential complexity.

-mental


#10

On 4/11/07, MenTaLguY removed_email_address@domain.invalid wrote:

On Wed, 11 Apr 2007 16:53:26 +0900, “Robert D.” removed_email_address@domain.invalid wrote:

What I actually meant is that given the need for backtracking
abilities it will just not be possible to “guess” what the user meant
when he/she uses too general an expression and I am talking about NFA
here.

I’m a little confused – the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn’t any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For “pure” regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking’s worst-case exponential complexity.

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match, let me see the NFA should be something like
X start second
a start term
. second second

how could backtracking avoid when no a is found in the state?
The DFA of course could never backtrack but is there a DFA for this
regex?

Hope to learn a little bit more about what I seem to have forgotten :wink:

Cheers
Robert
%r


#11

stupid me make this simply /a*.*a/ sorry


#12

Robert D. wrote:

stupid me make this simply /a*.*a/ sorry
Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It’s quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That’s the
main problem that is currently solved with backtracking implementations.

What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don’t use
backreferences so often, you could have Thompson NFA for those cases
where it’s fast, and switch to a backtracking engine only when you need
it.

Cheers


Ola B. (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

“Yields falsehood when quined” yields falsehood when quined.


#13

On 4/10/07, MenTaLguY removed_email_address@domain.invalid wrote:


#14

On 4/11/07, Ola B. removed_email_address@domain.invalid wrote:

Robert D. wrote:

stupid me make this simply /a*.*a/ sorry
Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It’s quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That’s the
main problem that is currently solved with backtracking implementations.

Thx Ola, I’ll look it up.

What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don’t use
backreferences so often, you could have Thompson NFA for those cases
where it’s fast, and switch to a backtracking engine only when you need it.
Yeah I was not concerned with the backreferences, just trying to
understand the basic stuff :wink:

Cheers


Robert


#15

On 4/11/07, Robert D. removed_email_address@domain.invalid wrote:

On 4/11/07, Ola B. removed_email_address@domain.invalid wrote:

Thx Ola, I’ll look it up.
Done :wink:
just wanted to share the link
http://swtch.com/~rsc/regexp/regexp1.html
Robert


#16

On 11.04.2007 22:51, MenTaLguY wrote:

final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter’s current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

While this is true in theory there is a number of modern RX features
that are extremely difficult or impossible to do with a DFA. So as it
stands now for pratical purposes most modern RX engines are NFA or
hybrids, sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see “Mastering Regular Expressions” for details).

Kind regards

robert


#17

On Thu, 12 Apr 2007 03:27:04 +0900, “Robert D.”
removed_email_address@domain.invalid wrote:

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this
(after collapsing most of the epsilon transitions introduced by
Thompson’s Algorithm):

start: (’’ ->a_star | ‘’ ->any_star),
a_star: (‘a’ ->a_star | any ->any_star),
any_star: (any ->any_star | ‘a’ =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which
sets a counter (initialized to -1) to the current input position if that
is larger than the counter’s current value. When we have explored all
possible matches, the value of the counter will be the position of the
last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA
corresponds to a superposition of states in the original NFA (again in
pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a,
b, and c):

{start,a_star,any_star}: (
‘a’ =>{a_star,any_star,final} |
(any-‘a’) ->{any_star}
),

{any_star}: (
‘a’ =>{any_star,final} |
(any-‘a’) ->{any_star}
),

{a_star,any_star,final}: (
‘a’ =>{a_star,any_star,final} |
(any-‘a’) ->{any_star,final}
),

{any_star,final}: (
‘a’ =>{any_star,final} |
(any-‘a’) ->{any_star,final}
)

The resulting FA is deterministic, and {a_star,any_star,final} and
{any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional
counters and labeled transitions with priorities to resolve
greedy/non-greedy ambiguities (as is often done in Ragel to resolve
ambiguities). Prioritized transitions are also necessary for regexps
which are not anchored at the beginning.

-mental


#18

On 10.04.2007 21:14, MenTaLguY wrote:

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we’ve actually regressed since then.

I’m not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of “regexes” which don’t use non-regular features.

I think you got NFA and DFA mixed up: DFA’s are the ones that do not
suffer backtracking issues.

robert


#19

On 4/11/07, MenTaLguY removed_email_address@domain.invalid wrote:

final: (any ->final)
),

Thanx for the explanation :), I guess I learnt this algorithm once
upon a time.:(.
Anyway relearning is learning too.
Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.
Thx again for your time and patience, and yes I have to change my
initial statement

there are no bad news :)) [ still ignoring backreferences ]

Cheers
Robert


#20

“Robert D.” removed_email_address@domain.invalid writes:

For “pure” regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking’s worst-case exponential complexity.

Guys, I think your possibly mixing up backtracking and backreferences.
It is
backreferences (referencing a earlier match, often to use as part of the
definition of a latter match) that NFA is not able to do. both
approaches use
backtracking. The NFA approach is able to parallelise the choices,
reducing the
overheads of backtracking. So, given n different possible paths to
follow,
instead of, in the worst case, trying n-1 paths and having to backtrack
after
each one, you can try all n paths at once. At least, I htink thats what
the
article was saying form a very quick read and from what I can remember
from my
computability course and finite automata (oh so many years ago now!).

Tim