Slow ruby regexes

emmanuel · April 12, 2007, 1:21pm

On 4/12/07, Tim X [email protected] wrote:

“Robert D.” [email protected] writes:

> > Guys, I think your possibly mixing up backtracking and backreferences. It is > backreferences (referencing a earlier match, often to use as part of the > definition of a latter match) that NFA is not able to do. No Tim I was not mixing it up, I said that I was not interested in the backreferencing history. You see I am just forgetting the basic stuff but Ola and Mental kindly gave me a refresher. > both approaches use > backtracking. The DFA cannot use backtracking, surely I am missing something *again*! >The NFA approach is able to parallelise the choices, reducing the > overheads of backtracking. So, given n different possible paths to follow, > instead of, in the worst case, trying n-1 paths and having to backtrack after > each one, you can try all n paths at once. At least, I htink thats what the > article was saying form a very quick read and from what I can remember from my > computability course and finite automata (oh so many years ago now!). For me it is 25 years and for you? Yeah I understood that in the same way, actually it is quite simple :) > > Tim > -- > tcross (at) rapttech dot com dot au > > Cheers Robert

emmanuel · April 12, 2007, 12:57pm

Emmanuel [email protected] writes:

Regular Expression Matching Can Be Simple And Fast

I became a little worried since i’m making hevy use of regexes in a
program of mine that shortly i’ll have to run on each of thousands of
text files.

I don’t know about proposed plans for the regexp engine for ruby, but I
would
say not to be overly concerned at this stage.

From reading that article, one thing that I noted was that even the
author
acknowledged that the performance of regular expressions is
significantly
affected by how the regexp are defined. If you keep in mind the basic
concepts
and create your regexp accordingly, you will likely get performance
differences
that even outweigh the differences between the two approaches outlined
in that
article - or putting it another way, poorly specified RE will perform
badly
regardless of the algorithm used. What is important is to do things like
anchoring, using as precise specification as possible and taking
advantage of
any knowledge regarding the data you are processing.

I’ve not done large (ie. gigabytes of data) processing with RE under
ruby, but
I have done so with perl and the performance was quite acceptable.

There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your
task may
fall well within your performance requirements.

The other way to look at this is to consider what you would do/use as an
alternative. I’ve also used a lot of Tcl and according to that article,
Tcl
uses the faster algorithm, yet I’ve never really noticed any great
performance
difference between using Perl or Tcl. So, your choice is to continue and
see if
there is a problem and then deal with that if/when it occurs or jump now
and
start writing your program in Tcl, awk or using grep (maybe even call
grep from
within ruby, but I suspect all performance gains would be lost in
passing the
data between ruby and the grep process).

I’ve seen many people make a decision regarding the choice of technology
because they have read somewhere that x is faster than y. Often, I’ve
then seen
something created which is flakey, takes 10x as long to develop or
simply
doesn’t work when in reality, the aspect they were concerned about
wasn’t even
relevant to the situation they were working in. Recently, I had an
argument
with one of our sys admins who wanted to use the Riser FS rather than
Ext3 as
the file system on a new server. His argument was that Riser had better
performance characteristics and the system would perform better. My
argument
was that file system performance was not a significant bottleneck for
the
server and we would be better off sticking with a file system that had a
better
track record, more robust tools and represented a technology more sys
admins
were familiar with. I lost the argument, initially. The server was
configured
with RiserFS and after a few months, we came in one morning to find
massive
disk curruption problems. The server was changed to Ext3 and has not
missed a
beat since. More to the point, the performance using Ext3 is still well
within
acceptable performance metrics. My point isn’t that RiserFS may not be a
good
file system - it probably is and possibly is even “better” than Ext3. My
point
is that speed is not the only issue to consider.

Something which the article doesn’t address is complexity and
correctness. The
algorithm used by Perl et. al. may not be fast compared to the
alternative, but
it is relatively simple. As important to performance is correctness. An
extremely fast RE is not a better solution if it is only correct 95% of
the
time or is so complex, it is difficult to maintain without bugs creeping
in
after version updates etc.

Tim

emmanuel · April 12, 2007, 1:45pm

Robert K. [email protected] writes:

start: (‘’ ->a_star | ‘’ ->any_star),
While this is true in theory there is a number of modern RX features that are
extremely difficult or impossible to do with a DFA. So as it stands now for
pratical purposes most modern RX engines are NFA or hybrids, sometimes (Perl)
with heavy optimizations to prevent certain ugly effects (see “Mastering
Regular Expressions” for details).

Kind regards

To me, this is possibly a much more important point than just being
concerned
over raw performance. While NFA may be a lot better performing in the
majority of cases, if we want to add in the exra power of backreferences
and
other ‘nice’ (possibly considered advanced) features, things begin to
get a lot
more complicated. Unfortunately, with complexity, we usually find higher
levels
of bugs in the final implementation.

I’m not saying that things should be kept super simple just to ensure
correctness at the cost of performance. However, we should consider
complexity
and correctness/maintainability of implementation as well as just pure
performance. For me and possibly for many others, I find this
particularly
important as performance of regular expressions has never been a
critical issue
with any of the work I’ve had to use them for (to what extent this is
just a
reflection of the types of problems I deal with I don’t know). In fact,
everyone I’ve ever helped with sorting out performance problems with
their
regexp, it has turned out to be badly defined regexp that were the issue
rather
than poor performance of the engine/underlying algorithm.

Maybe the real question we should be asking is "If ruby had a regexp
engine
which was 10, 100, 1000, 1000000 times faster, would that allow us to
solve
problems which we cannot solve now in any practicle way?

In principal, I think having a regexp engine which possibly adopts the
DFA/NFA
algorithim if backreferences are not required and use a slower
backtracking
approach when they are is a good approach, but only if we can ensure a
stable
and correct implementation.

Tim

emmanuel · April 12, 2007, 2:01pm

“Robert D.” [email protected] writes:

No Tim I was not mixing it up, I said that I was not interested in the
backreferencing history.
You see I am just forgetting the basic stuff but Ola and Mental kindly
gave me a refresher.

both approaches use
backtracking.
The DFA cannot use backtracking, surely I am missing something again!
Doh!

The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to follow,
instead of, in the worst case, trying n-1 paths and having to backtrack after
each one, you can try all n paths at once. At least, I htink thats what the
article was saying form a very quick read and from what I can remember from my
computability course and finite automata (oh so many years ago now!).
For me it is 25 years and for you?
Nearly the same - computability course was 82/83, but I did go back an
do som
postrad in early 90s, so my knowledge cache should have been fresher!
Guess its
that bloody LRU memory cache algorithm trying to make room again!

Yeah I understood that in the same way, actually it is quite simple

Its funny, back in the 80…83, the courses I enjoyed the most were
computability, parallel processing, compilers and assembly. Now, when I
think
about anything in those areas, I just have vague recollections and no
real
facts. Since those days, work has been about databases, dull user
interface
programming, 4GLs, web technology etc. Up until 4 months ago, it was an
increaseing deluge of
project management methodologies, bullshitp bingo management speak, HR,
budgets
and sales. Now I’m getting back to bare boned technology and its scary
how much
I’ve forgotten! Still, at least I’m now looking at Ruby (amongst many
other
things!).

Tim

emmanuel · April 12, 2007, 2:15pm

On 4/12/07, Tim X [email protected] wrote:

and sales. Now I’m getting back to bare boned technology and its scary how much
I’ve forgotten! Still, at least I’m now looking at Ruby (amongst many other
things!).

Amen, but I think one shall be optimist

Tim

–
tcross (at) rapttech dot com dot au

Robert

emmanuel · April 12, 2007, 6:32pm

On Thu, 12 Apr 2007 16:10:06 +0900, Robert K.
[email protected] wrote:

Every NFA can be rewritten as a DFA. Each state in the resulting DFA
corresponds to a superposition of states in the original NFA

While this is true in theory there is a number of modern RX features
that are extremely difficult or impossible to do with a DFA. So as it
stands now for pratical purposes most modern RX engines are NFA or
hybrids,

It’s important not to confuse NFAs with backtracking. For instance,
Thompson’s algorithm uses NFAs, but it evaluates by rewriting them
on-the-fly as DFAs, not by performing a backtracking search. Again,
NFAs and DFAs are equivalent, even if the naive algorithm (backtracking)
for evaluating an NFA is less efficient than the naive algorithm for
evaluating a DFA.

On the other hand, those modern features which really do require
backtracking (surprisingly few) also exceed the capabilities of finite
automata (i.e. those features correspond to non-regular languages), so
at that point the regular expression engines cannot be said to be using
FAs at all, but a more general type of automaton.

Again, though we must obviously keep backtracking in reserve for those
patterns that need it, it is very wasteful not to use a pure FA approach
for those patterns where non-regular features aren’t being used.

sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see “Mastering Regular Expressions” for details).

“Mastering Regular Expressions” is very good for learning the pragmatics
of working with contemporary regular expression implementations, but it
is not a good source for theory.

-mental

emmanuel · April 12, 2007, 6:44pm

On Thu, 12 Apr 2007 19:55:09 +0900, Tim X [email protected] wrote:

There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your
task may fall well within your performance requirements.

This is very, very true – don’t optimize until you measure. For the
sake of the original poster, I want to underscore this point. Chances
are the Ruby regexp engine is way more than adequate for his purposes.

The algorithm used by Perl et. al. may not be fast compared to the
alternative, but it is relatively simple. As important to performance is correctness.

Perl’s implementation is not very simple, however…

-mental

emmanuel · April 12, 2007, 6:12pm

On Thu, 12 Apr 2007 16:59:47 +0900, “Robert D.”
[email protected] wrote:

Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.

The Thompson algorithm essentially just does the NFA → DFA conversion
lazily.

-mental

emmanuel · April 12, 2007, 7:25pm

On 4/13/07, MenTaLguY [email protected] wrote:

On Thu, 12 Apr 2007 16:59:47 +0900, “Robert D.” [email protected] wrote:

Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.

The Thompson algorithm essentially just does the NFA → DFA conversion lazily.

This would be true if the Thompson algorithm was caching the states as
it went. If I recall correctly (it’s been a while since I read the
article) it doesn’t cache the states and Russ Cox’s code in the
article definitely doesn’t. So basically, the difference between
Thompson’s parallel NFA evaluation and the “naive” backtracking NFA
evaluation is breath first search vs depth first search. The reason
breadth first search is so much faster in this problem is that you
avoid repeatedly going down the same search path. Given an unlimited
amount of memory and using dynamic programming techniques you could
presumably make the backtracking NFA evaluation run as fast as the
Thompson’s algorithm. In fact I read somewhere that the Perl regex
engine has started to use dynamic programming, obviously without great
success.

So there are going to be some situations where a DFA will be a lot
faster than a Thompson NFA. But, having said that, I think it is
important to note (and I don’t think anyone else has mentioned this)
that it is not always better to compile a regex into a DFA, even if it
is possible. In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation. I don’t have any hard data to show this but I’m pretty
sure there will be some cases where Perl’s and Ruby’s regex engines
will easily beat the awk and grep engines.

By the way, I highly recommend people check out Ville Laurikari’s
regex engine TRE;

http://laurikari.net/tre/

While the predictable worst-case performance is nice, the really cool
thing that I like about this project is the approximate matching. This
is something I’d really like to have in Ruby.

emmanuel · April 12, 2007, 7:53pm

On Fri, 13 Apr 2007 02:47:34 +0900, MenTaLguY [email protected] wrote:

In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation.

Namely, when most of the states in the NFA will not be visited for a
particular input. That is why the DFA construction is done lazily.

Sorry, I meant to say that it’s when most of the states in the DFA
will not be visited for a particular input.

Generally speaking, you can visit all the states in the NFA without
necessarily visiting all the states in the DFA.

-mental

emmanuel · April 12, 2007, 7:47pm

On Fri, 13 Apr 2007 02:25:15 +0900, “David B.”
[email protected] wrote:

This would be true if the Thompson algorithm was caching the states as
it went. If I recall correctly (it’s been a while since I read the
article) it doesn’t cache the states and Russ Cox’s code in the
article definitely doesn’t.

There is a section in his article entitled “Caching the NFA to build a
DFA” with code doing exactly that.

In fact I read somewhere that the Perl regex engine has started to use dynamic
programming, obviously without great success.

It’s mentioned in his article as well.

In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation.

Namely, when most of the states in the NFA will not be visited for a
particular input. That is why the DFA construction is done lazily.

By the way, I highly recommend people check out Ville Laurikari’s
regex engine TRE;
http://laurikari.net/tre/
While the predictable worst-case performance is nice, the really cool
thing that I like about this project is the approximate matching. This
is something I’d really like to have in Ruby.

That’s really cool.

-mental

emmanuel · April 12, 2007, 8:40pm

On 4/13/07, MenTaLguY [email protected] wrote:

It’s mentioned in his article as well.

Right you are. Sorry, I read the article a couple of months back when
it was posted and I thought I remembered it well enough to comment
without reading it again. I guess not.

In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation.

Namely, when most of the states in the NFA will not be visited for a particular input. That is why the DFA construction is done lazily.

True, lazy construction of the DFA definitely seems like a very good
solution in most cases. I guess what I was trying to say though is
that the seemingly naive backtracking NFA algorithm will actually be
the fastest solution in some cases as there may be little or no
backtracking and even simply caching the DFA states could be
unnecessary overhead. Therefore, there is no one solution that beats
them all. When matching a simple string like /supercalifraglistic/ for
instance, I believe the simple backtracking algorithm would be faster
than Thompson’s algorithm. In fact, there are sub-linear expected
algorithms that exist for matching simple strings like this. I’m
currently reading a good book about this called “Flexible Pattern
Matching in Strings” by Gonzalo Navarro and Mathieu Raffinot. Worth a
read if you can get your hands on it.

emmanuel · April 12, 2007, 9:21pm

On Fri, 13 Apr 2007 02:55:20 +0900, Jeremy H.
[email protected] wrote:

I think this (and much of the rest of the discussion) rather misses
the point. The DFA algorithm performs poorly in some circumstances
because it may backtrack to the same state many times.

DFA evaluation does not backtrack. Assuming epsilon transitions are
collapsed when constructing the DFA, for a given input string there will
be exactly n (or less, in the case of failure) transitions taken, where
n is the number of characters in the input string.

Perhaps you meant that the same NFA state is potentially represented in
exponentially many different DFA states? That would indeed be a
problem, except…

This is because there are eg. 3 ways of matching (a?)(a?)(a?) against “aa”
(depending on which of the three a’s you consider to be optional)

In the case of regular expressions, this ambiguity is resolved by the
usual “greedy” versus “non-greedy” distinctions, which can be expressed
by assigning weights to transitions in the NFA (thereby pruning the
number of states in the resulting DFA). This is pretty much the same
way you have to deal with ambiguities when constructing Ragel grammars.

-mental

emmanuel · April 12, 2007, 7:56pm

On 2007-04-12, Tim X [email protected] wrote:

The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to
follow, instead of, in the worst case, trying n-1 paths and having
to backtrack after each one, you can try all n paths at once.

I think this (and much of the rest of the discussion) rather misses
the point. The DFA algorithm performs poorly in some circumstances
because it may backtrack to the same state many times. The NFA
algorithm wins in such cases because it generates a list of all
feasible states before attempting to progress any of them, and it is
therefore easy to optimise by eliminating duplicates from the list.

Here’s a concrete example. Suppose we are trying to match a regular
expression containing (a?)(a?)(a?)…(a?). If n is the number of
repetitions of (a?), then in the worst case where the match fails the
DFA explores 2^n possibiilities before failure. But in the course of
doing that it only enters n(n+1)/2 distinct states. This is because
there are eg. 3 ways of matching (a?)(a?)(a?) against “aa” (depending
on which of the three a’s you consider to be optional) and the DFA
attempts to complete the match for each of those possibilities, even
though the calculation is identical in each case. The only thing that
matters is how many (a?)s you have matched so far, and how many "a"s
you have matched them against, but the DFA isn’t smart enough to see
that it is repeating the same calculation 3 times.

The NFA wins because it does not search all 2^n combinations of
choices in the regexp, it searches the n(n+1)/2 possible states of the
regexp engine. The three different ways of matching (a?)(a?)(a?)
against “aa” all map to the same state of the regexp engine, so the
NFA checks this possibility only once.

So, the reason for the NFA’s success is not that it “tries all
possibilities at once” or that it “avoids backtracking”. Rather, it
reformulates the problem so that the search space is much smaller.
Equivalently, it implicitly caches the results of previous attempts so
that it can say “hmm, I’ve been here before and it didn’t work then,
so I won’t bother trying again”. If you accept the maxim that madness
is trying the same thing in the hope of different results, the DFA
loses because it is insane!

Regards,

Jeremy H.

emmanuel · April 13, 2007, 3:47am

On Fri, 13 Apr 2007 09:30:07 +0900, Tim X [email protected] wrote:

True, but I think thats due to all the additional add-ons. The basic
underlying algorithm is/was fairly straight forward.

I was thinking in terms of the optimizations applied to make the
backtracking algorithm faster, though that’s a factor too…

However, I think even the perl developers have realised that Perl’s RE support
was getting out of hand.
Although I’ve not looked at it, I believe there is significant changes in
perl 6 in this area. I seem to remember reading something from Larry, where he
said that Perl’s functionality in this area has extended to the point that in
reality, you probably couldn’t call it regexps anymore, but as that was
the terminology people were use to, thats what it wold be known as. Perl has
gone past what was traditionally defined as regexps.

The Perl 6 solution was to promote grammars and rules into first-class
abstractions in the language, on a level with classes and methods. So
it’s basically a general-purpose parsing framework now. I’m not sure if
that counts as less out-of-hand or not…

-mental

emmanuel · April 13, 2007, 2:30am

MenTaLguY [email protected] writes:

Perl’s implementation is not very simple, however…

True, but I think thats due to all the additional add-ons. The basic
underlying
algorithm is/was fairly straight forward. However, I think even the perl
developers have realised that Perl’s RE support was getting out of hand.
Although I’ve not looked at it, I believe there is significant changes
in perl
6 in this area. I seem to remember reading something from Larry, where
he said
that Perl’s functionality in this area has extended to the point that in
reality, you probably couldn’t call it regexps anymore, but as that was
the
terminology people were use to, thats what it wold be known as. Perl has
gone
past what was traditionally defined as regexps.

Tim

emmanuel · April 13, 2007, 9:01am

On 12.04.2007 18:31, MenTaLguY wrote:

On the other hand, those modern features which really do require backtracking (surprisingly few) also exceed the capabilities of finite automata (i.e. those features correspond to non-regular languages), so at that point the regular expression engines cannot be said to be using FAs at all, but a more general type of automaton.
Strinctly speaking yes. But AFAIK the basis is usually a NFA.

Btw, because of the nature of the engine there are some more differences
between DFA and NFA, for example typically order of expressions in an
alternative matters with a NFA engine. I prefer that often, because it
allows to control the way the RX engine matches to a certain extent,
i.e. you can prioritize alternatives which can be pretty handy at times.

Again, though we must obviously keep backtracking in reserve for those patterns that need it, it is very wasteful not to use a pure FA approach for those patterns where non-regular features aren’t being used.

sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see “Mastering Regular Expressions” for details).

“Mastering Regular Expressions” is very good for learning the pragmatics of working with contemporary regular expression implementations, but it is not a good source for theory.

Since I had the theory at university and at the moment I’m more
interested in using them than writing them this is good enough for me.

Kind regards

robert

emmanuel · April 13, 2007, 10:03pm

On Fri, 2007-04-13 at 16:00 +0900, Robert K. wrote:

Btw, because of the nature of the engine there are some more differences
between DFA and NFA, for example typically order of expressions in an
alternative matters with a NFA engine. I prefer that often, because it
allows to control the way the RX engine matches to a certain extent,
i.e. you can prioritize alternatives which can be pretty handy at times.

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches. NFAs and DFAs really
are equivalent in expressiveness.

-mental

emmanuel · April 13, 2007, 11:52pm

On 13.04.2007 22:02, MenTaLguY wrote:

On Fri, 2007-04-13 at 16:00 +0900, Robert K. wrote:

Btw, because of the nature of the engine there are some more differences
between DFA and NFA, for example typically order of expressions in an
alternative matters with a NFA engine. I prefer that often, because it
allows to control the way the RX engine matches to a certain extent,
i.e. you can prioritize alternatives which can be pretty handy at times.

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches. NFAs and DFAs really
are equivalent in expressiveness.

How do you do that?

Kind regards

robert

emmanuel · April 14, 2007, 6:34am

On Sat, 2007-04-14 at 06:50 +0900, Robert K. wrote:

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches. NFAs and DFAs really
are equivalent in expressiveness.

How do you do that?

Let’s use a real simple example: /^(?:(ab)|(…))$/

Here, we want capturing subexpression 1 to have priority over capturing
subexpression 2. This time, I’ll use a [] notation to annotate
transitions with capture boundaries and priorities.

First, the NFA:

start: (’’ [start $1]->group_1 | ‘’ [start $2]->group_2),
group_1: (‘a’ ->group_1b),
group_1b: (‘b’ [end $1,priority 1]->final),
group_2: (any ->group_2b),
group_2b: (any [end $2,priority 0]->final)

Now, the equivalent DFA:

{start}: (’’ [start $1,start $2]->{group_1,group_2}),
{group_1,group_2}: (
‘a’ ->{group_1b,group_2b} |
(any-‘a’) ->{group_2b}
),
{group_1b,group_2b}: (
‘b’ [end $1]->{final} |
(any-‘b’) [end $2]->{final}
)
{group_2b}: (any [end $2]->{final})

See how that works?

More complex examples can be attacked using the same principles, though
it gets painful to work by hand pretty fast.

-mental