Slow ruby regexes

emmanuel · April 16, 2007, 12:56pm

On 14.04.2007 15:23, MenTaLguY wrote:

The same approach works if you construct the DFA in advance or via
memoization of the Thompson NFA evaluation (which is probably the better
of the two for reasons already discussed).

Hmm… I was more after how you would notate the RX for a DFA based
engine to do the prioritization. I do not know a DFA based engine that
would actually allow to write a RX pattern to do that.

Of course, when building engines manually you can do a lot - and
probably a lot more than what actual RX engines allow. But since I’m
interested in using them vs. writing RX engines your solution was not
exactly what I expected.

Kind regards

robert

emmanuel · April 14, 2007, 3:23pm

To be fair, I oversimplified a bit because I couldn’t think of a better
example until just now. Though you can prune the DFA with priorities in
the NFA, an expression like /^(?:(abc)|(a)).*$/ still requires more
sophisticated handling, explicitly tracking alternative matches.

Here’s an implementation of a Thompson NFA which does Perl-style
capture:

http://swtch.com/~rsc/regexp/nfa-perl.y.txt

The same approach works if you construct the DFA in advance or via
memoization of the Thompson NFA evaluation (which is probably the better
of the two for reasons already discussed).

-mental

emmanuel · April 16, 2007, 8:55pm

On Mon, 16 Apr 2007 19:55:05 +0900, Robert K.
[email protected] wrote:

Here’s an implementation of a Thompson NFA which does Perl-style
capture:

http://swtch.com/~rsc/regexp/nfa-perl.y.txt

Hmm… I was more after how you would notate the RX for a DFA based
engine to do the prioritization.

We note choice points in our tags, and also tag those subsequent
transitions which would disprove a particular choice (for instance, for
/^(?:(abc)|(a)).*$/, the second transition in “axy” disproves the first
alternative, and the third transition in “abc” disproves the second
alternative).

Referring to those tags as we encounter them during FA evaluation, we
can then accumulate sets of plausible captures and cull those which are
disproven as we go.

I do not know a DFA based engine that would actually allow to write a RX
pattern to do that.

A non-backtracking NFA simulation (i.e. Thompson NFA) operates by
constructing and evaluating the corresponding DFA on-the-fly (each
unique set of NFA states corresponds to a DFA state). If you memoize
the construction of DFA states, then you’ve got a straight DFA engine
(the DFA is constructed lazily). Some RX engines have historically
built the entire DFA up front, but as previously discussed, that
approach is wasteful (many of the DFA states so constructed may never
get visited).

The same techniques (using tagged transitions to drive other code as the
FA is evaluated) are applicable in all three cases (non-backtracking NFA
simulation, dynamic DFA construction, and static DFA construction).

Of course, when building engines manually you can do a lot - and
probably a lot more than what actual RX engines allow.

Right, so I thought it appropriate to give an example of an actual RX
engine which allowed the same (and more) as my hand-worked examples by
extending the same class of techniques I had illustrated so far.

But since I’m interested in using them vs. writing RX engines your
solution was not exactly what I expected.

I’ll admit to being a little confused – you’re interested in using a RX
engine rather than implementing one, but you’d rather see a description
and/or hand-worked example of an implementation technique rather than a
link to a finished implementation using the same techniques under
discussion?

-mental

emmanuel · April 17, 2007, 4:10pm

On 16.04.2007 20:54, MenTaLguY wrote:

On Mon, 16 Apr 2007 19:55:05 +0900, Robert K. [email protected] wrote:

But since I’m interested in using them vs. writing RX engines your
solution was not exactly what I expected.

I’ll admit to being a little confused – you’re interested in using a RX engine rather than implementing one, but you’d rather see a description and/or hand-worked example of an implementation technique rather than a link to a finished implementation using the same techniques under discussion?

LOL - now I’m confused, too. I was looking for something like (made
up fake): “To prioritize branches of an alternative in sed you do sed -e
‘s/foo(!2bar|!1baz)/yummy\1/g’.” Of course this does not work with sed
(at least no implementation of sed that I know). Are there other tools
that allow to embed something into a RX that will do this
prioritization?

Kind regards

robert

emmanuel · April 17, 2007, 5:44pm

On Tue, 17 Apr 2007 23:10:11 +0900, Robert K.
[email protected] wrote:

LOL - now I’m confused, too. I was looking for something like (made
up fake): “To prioritize branches of an alternative in sed you do sed -e
‘s/foo(!2bar|!1baz)/yummy\1/g’.” Of course this does not work with sed
(at least no implementation of sed that I know). Are there other tools
that allow to embed something into a RX that will do this prioritization?

OHHHHHHHH.

| is a deterministic choice operator, so the branches are prioritized by default – earlier alternatives take precedence over later ones. Just write your RX alternatives from highest to lowest priority.

Does that answer your question?

-mental

emmanuel · April 17, 2007, 8:19pm

On Wed, 18 Apr 2007 01:10:07 +0900, Robert K.
[email protected] wrote:

I was under the impression that this precisely is a point of difference
between NFA and DFA based engines: sed is DFA based and for all I know
order in an alternative does not matter while it does matter for the
typical NFA engine. Did I get this wrong?

Well, for historical reasons, sed implements only “basic” (versus
“extended”) regular expressions, so it does not have explicit
alternation via the | operator. However, we can still look at it for an
example of how a DFA-based engine can handle choosing between
alternatives.

Consider:

echo “aaaaa” | sed -e ‘s/^(a*)(a*)$/\1,\2/’

There are six possible ways to match this regular expression:

aaaaa,
aaaa,a
aaa,aa
aa,aaa
a,aaaa
,aaaaa

sed always chooses the first alternative. Choice between alternatives
for explicit alternation would be handled the same way.

-mental

emmanuel · April 18, 2007, 9:56am

On 17.04.2007 20:17, MenTaLguY wrote:

On Wed, 18 Apr 2007 01:10:07 +0900, Robert K. [email protected] wrote:

I was under the impression that this precisely is a point of difference
between NFA and DFA based engines: sed is DFA based and for all I know
order in an alternative does not matter while it does matter for the
typical NFA engine. Did I get this wrong?

Well, for historical reasons, sed implements only “basic” (versus “extended”) regular expressions, so it does not have explicit alternation via the | operator. However, we can still look at it for an example of how a DFA-based engine can handle choosing between alternatives.

Ah! I reckon this a specialty of GNU sed:

09:45:46 [~]: sed -ne ‘s#a|b#A#gp’ <<X

aaaa
bbbb
X
AAAA
AAAA
09:46:02 [~]:

On an old Solaris box I get

bash-2.03# sed -ne ‘s#a|b#A#gp’ <<X

aaaaa
bbbbb
X
bash-2.03#

Of course GNU sed is more capable than “standard” sed…

a,aaaa
,aaaaa

sed always chooses the first alternative. Choice between alternatives for explicit alternation would be handled the same way.

Kind regards

robert

emmanuel · April 17, 2007, 6:10pm

On 17.04.2007 17:44, MenTaLguY wrote:

On Tue, 17 Apr 2007 23:10:11 +0900, Robert K. [email protected] wrote:

LOL - now I’m confused, too. I was looking for something like (made
up fake): “To prioritize branches of an alternative in sed you do sed -e
‘s/foo(!2bar|!1baz)/yummy\1/g’.” Of course this does not work with sed
(at least no implementation of sed that I know). Are there other tools
that allow to embed something into a RX that will do this prioritization?

OHHHHHHHH.

I hope it doesn’t hurt too much.

| is a deterministic choice operator, so the branches are prioritized by default – earlier alternatives take precedence over later ones. Just write your RX alternatives from highest to lowest priority.

Does that answer your question?

Maybe.

I was under the impression that this precisely is a point of difference
between NFA and DFA based engines: sed is DFA based and for all I know
order in an alternative does not matter while it does matter for the
typical NFA engine. Did I get this wrong?

Kind regards

robert