Forum: Ruby Slow ruby regexes

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
07752bc66e807b599e802595f0fdba13?d=identicon&s=25 Emmanuel (Guest)
on 2007-04-10 19:12
Hello i've been reading this article, wich has a few benchmarks
(inclding a ruby related one) and many theorical explanations of regex
matching algorithms:

http://swtch.com/~rsc/regexp/regexp1.html

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

* Do anyone now if what is said relates to ruby 1.8.6
* If so, is it going to change ?

Thank you!
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-10 19:57
(Received via mailing list)
On Wed, 11 Apr 2007 02:12:09 +0900, Emmanuel <emmanuel@lijasdoblea.com>
wrote:
> I became a little worried since i'm making hevy use of regexes in a
> program of mine that shortly i'll have to run on each of thousands of
> text files.

Ruby's regexp implementation is a backtracking matcher similar to Perl's
-- while the article's criticism is valid, you should be fine if you
write your regular expressions to minimize the amount of backtracking
required.

(The author's point is that such care is unnecessary with an NFA-based
regexp implementation.)

-mental
7b4707f974af261f71943e1f2046c9ee?d=identicon&s=25 SonOfLilit (Guest)
on 2007-04-10 19:59
(Received via mailing list)
Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

For example, a regex was shown in this list lately that tells you if a
number is prime. That needs backtracking.


Aur
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-10 20:04
(Received via mailing list)
On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit <sonoflilit@gmail.com>
wrote:
> Read wikipedia on Regex. It explains better than I can why one is used
> over the other (more power).

It's unfortunate, though, that an NFA-based approach isn't selectively
used for those expressions which don't use features like backreferences
(i.e. the majority of regular expressions people write).  The
performance is so much better it's not funny.

-mental
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-10 20:58
(Received via mailing list)
On 4/10/07, MenTaLguY <mental@rydia.net> wrote:
> On Wed, 11 Apr 2007 02:59:29 +0900, SonOfLilit <sonoflilit@gmail.com> wrote:
> > Read wikipedia on Regex. It explains better than I can why one is used
> > over the other (more power).
>
> It's unfortunate, though, that an NFA-based approach isn't selectively used for those 
expressions which don't use features like backreferences (i.e. the majority of regular 
expressions people write).  The performance is so much better it's not funny.
>

Well the good news is that Ruby 2 Regexs will be much better
performing, I do not know if they will be NFA, though.
The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

Cheers
Robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-10 21:16
(Received via mailing list)
On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober"
<robert.dober@gmail.com> wrote:

> The bad news is that the power of regexs will just not let you escape
> with bad design all the time, so knowing a lot about regexes is a
> necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations
will not let you escape with using "regexes" as if they were true
regular expressions; knowing a lot about the implementation of the regex
engine is a necessary thing to use them well.

> But you are right of course that Ruby's current Regex implementation
> is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s.  Backreferences are nice, but
in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are
useful sometimes), but rather that NFAs should be used for evaluating
the majority of "regexes" which don't use non-regular features.

-mental
F0afd024e17c0c4753aa8d618ba9bb0f?d=identicon&s=25 Marcin Raczkowski (Guest)
on 2007-04-11 09:44
(Received via mailing list)
On Tuesday 10 April 2007 19:14, MenTaLguY wrote:
> On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com>
wrote:
> > is not state of the art, useful, very useful nevertheless.
>
> NFAs were state-of-the-art in the 1960s.  Backreferences are nice, but in
> other respects we've actually regressed since then.
>
> I'm not suggesting we abandon backreferences and so forth (which are useful
> sometimes), but rather that NFAs should be used for evaluating the majority
> of "regexes" which don't use non-regular features.
>
> -mental

Well topic is rather interesting, i'll try to check it soon, and i'll
try to
write ruby interface for c library that author of article reference, i
was
looking for good case to test including c into ruby.
I'm not promising anything tho :)

regexps are great tool for text processing, but serious log analyzers
(like
one my company use) are written in c and base on lots of low-level
string
manipulation, raw binary access to DB et cecera, not to mention that
ruby is
much much slower then c anyway.

Greets
Marcin Raczkowski
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-11 09:51
(Received via mailing list)
On 4/10/07, MenTaLguY <mental@rydia.net> wrote:
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-11 09:54
(Received via mailing list)
oops wrong button here :(

On 4/11/07, Robert Dober <robert.dober@gmail.com> wrote:
> On 4/10/07, MenTaLguY <mental@rydia.net> wrote:
> > On Wed, 11 Apr 2007 03:56:28 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> >
> > > The bad news is that the power of regexs will just not let you escape
> > > with bad design all the time, so knowing a lot about regexes is a
> > > necessary thing to use them well.
> >
> > No, the bad news is that the bad design of most "regex" implementations will not let 
you escape with using "regexes" as if they were true regular expressions; knowing a lot 
about the implementation of the regex engine is a necessary thing to use them well.
That might be, I am not qualified to discuss this issue.
What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.
Well just wanted to clear that point up.
> >
> > > But you are right of course that Ruby's current Regex implementation
> > > is not state of the art, useful, very useful nevertheless.
> ><snip>
> >
> >
> >
>
Cheers
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-11 17:35
(Received via mailing list)
On Wed, 11 Apr 2007 16:53:26 +0900, "Robert Dober"
<robert.dober@gmail.com> wrote:
> What I actually meant is that given the need for backtracking
> abilities it will just not be possible to "guess" what the user meant
> when he/she uses too general an expression and I am talking about NFA
> here.

I'm a little confused -- the regexp features which require backtracking
correspond to well-defined elements of the syntax.  There isn't any need
to guess whether the user requires backtracking or not, one can simply
see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs),
backtracking and NFA-based evaluation yield equivalent results, except
that the NFA approach is O(n) in the number of characters, rather than
backtracking's worst-case exponential complexity.

-mental
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-11 20:28
(Received via mailing list)
On 4/11/07, MenTaLguY <mental@rydia.net> wrote:
> On Wed, 11 Apr 2007 16:53:26 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> > What I actually meant is that given the need for backtracking
> > abilities it will just not be possible to "guess" what the user meant
> > when he/she uses too general an expression and I am talking about NFA
> > here.
>
> I'm a little confused -- the regexp features which require backtracking correspond to 
well-defined elements of the syntax.  There isn't any need to guess whether the user 
requires backtracking or not, one can simply see whether such expressions are present in a 
particular regexp.
>
> For "pure" regular expressions (which can be matched by NFAs), backtracking and 
NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the 
number of characters, rather than backtracking's worst-case exponential complexity.
>
No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match, let me see the NFA should be something like
 X   start      second
 a    start         term
 .    second  second

how could backtracking avoid when no a is found in the <second> state?
The DFA of course could never backtrack but is there a DFA for this
regex?

Hope to learn a little bit more about what I seem to have forgotten ;)


Cheers
Robert
%r
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-11 20:30
(Received via mailing list)
<snip>

stupid me make this simply /a*.*a/ sorry
A5b01739148d8795e99582444361a1bc?d=identicon&s=25 Ola Bini (Guest)
on 2007-04-11 20:51
(Received via mailing list)
Robert Dober wrote:
> <snip>
>
> stupid me make this simply /a*.*a/ sorry
Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It's quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That's the
main problem that is currently solved with backtracking implementations.

What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don't use
backreferences so often, you could have Thompson NFA for those cases
where it's fast, and switch to a backtracking engine only when you need
it.

Cheers

--
 Ola Bini (http://ola-bini.blogspot.com)
 JvYAML, RbYAML, JRuby and Jatha contributor
 System Developer, Karolinska Institutet (http://www.ki.se)
 OLogix Consulting (http://www.ologix.com)

 "Yields falsehood when quined" yields falsehood when quined.
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-11 21:48
(Received via mailing list)
On 4/11/07, Ola Bini <ola.bini@gmail.com> wrote:
> Robert Dober wrote:
> > <snip>
> >
> > stupid me make this simply /a*.*a/ sorry
> Actually, to make it non-backtracking, you just need to update all the
> possible states at the same time. It's quite easy. Just look for
> Thompson NFA for a closer look at it.
> The problem is that it gets problematic with backREFERENCES. That's the
> main problem that is currently solved with backtracking implementations.
>
Thx Ola, I'll look it up.
> What Mentalguy said is basically that there should be better ways of
> implementing regex engines with backreferences. Since you don't use
> backreferences so often, you could have Thompson NFA for those cases
> where it's fast, and switch to a backtracking engine only when you need it.
Yeah I was not concerned with the backreferences, just trying to
understand the basic stuff ;)
>
> Cheers
>
> --
Robert
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-11 22:02
(Received via mailing list)
On 4/11/07, Robert Dober <robert.dober@gmail.com> wrote:
> On 4/11/07, Ola Bini <ola.bini@gmail.com> wrote:
> >
> Thx Ola, I'll look it up.
Done ;)
just wanted to share the link
http://swtch.com/~rsc/regexp/regexp1.html
Robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-11 22:52
(Received via mailing list)
On Thu, 12 Apr 2007 03:27:04 +0900, "Robert Dober"
<robert.dober@gmail.com> wrote:
> No it must be me who is confused, I thought that backtracking cannot
> be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
> match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this
(after collapsing most of the epsilon transitions introduced by
Thompson's Algorithm):

 start: ('' ->a_star | '' ->any_star),
 a_star: ('a' ->a_star | any ->any_star),
 any_star: (any ->any_star | 'a' =>final),
 final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which
sets a counter (initialized to -1) to the current input position if that
is larger than the counter's current value.  When we have explored all
possible matches, the value of the counter will be the position of the
last character in the (greedy) match.

> The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA.  Each state in the resulting DFA
corresponds to a superposition of states in the original NFA (again in
pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a,
b, and c):

 {start,a_star,any_star}: (
   'a' =>{a_star,any_star,final} |
   (any-'a') ->{any_star}
 ),

 {any_star}: (
   'a' =>{any_star,final} |
   (any-'a') ->{any_star}
 ),

 {a_star,any_star,final}: (
   'a' =>{a_star,any_star,final} |
   (any-'a') ->{any_star,final}
 ),

 {any_star,final}: (
   'a' =>{any_star,final} |
   (any-'a') ->{any_star,final}
 )

The resulting FA is deterministic, and {a_star,any_star,final} and
{any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional
counters and labeled transitions with priorities to resolve
greedy/non-greedy ambiguities (as is often done in Ragel to resolve
ambiguities).  Prioritized transitions are also necessary for regexps
which are not anchored at the beginning.

-mental
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-12 09:11
(Received via mailing list)
On 11.04.2007 22:51, MenTaLguY wrote:
>  final: (any ->final)
>
> Here, -> is a regular transition, and => is a labeled transition which sets a counter 
(initialized to -1) to the current input position if that is larger than the counter's 
current value.  When we have explored all possible matches, the value of the counter will 
be the position of the last character in the (greedy) match.
>
>> The DFA of course could never backtrack but is there a DFA for this regex?
>
> Every NFA can be rewritten as a DFA.  Each state in the resulting DFA corresponds to a 
superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes 
the superposition of NFA states a, b, and c):

While this is true in theory there is a number of modern RX features
that are extremely difficult or impossible to do with a DFA.  So as it
stands now for pratical purposes most modern RX engines are NFA or
hybrids, sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see "Mastering Regular Expressions" for details).

Kind regards

  robert
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-12 09:12
(Received via mailing list)
On 10.04.2007 21:14, MenTaLguY wrote:
>
> NFAs were state-of-the-art in the 1960s.  Backreferences are nice, but in other respects 
we've actually regressed since then.
>
> I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), 
but rather that NFAs should be used for evaluating the majority of "regexes" which don't 
use non-regular features.

I think you got NFA and DFA mixed up: DFA's are the ones that do not
suffer backtracking issues.

  robert
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-12 10:00
(Received via mailing list)
On 4/11/07, MenTaLguY <mental@rydia.net> wrote:
>  final: (any ->final)
>  ),
>
>
>
Thanx for the explanation :), I guess I learnt this algorithm once
upon a time.:(.
Anyway relearning is learning too.
Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.
Thx again for your time and patience, and yes I have to change my
initial statement

there are no bad news :)) [ still ignoring backreferences ]

Cheers
Robert
8ae46ae3170a36a0f79ea109ef0ee2e7?d=identicon&s=25 Tim X (Guest)
on 2007-04-12 12:57
(Received via mailing list)
Emmanuel <emmanuel@lijasdoblea.com> writes:

> http://swtch.com/~rsc/regexp/regexp1.html
>
> I became a little worried since i'm making hevy use of regexes in a
> program of mine that shortly i'll have to run on each of thousands of
> text files.

I don't know about proposed plans for the regexp engine for ruby, but I
would
say not to be overly concerned at this stage.

From reading that article, one thing that I noted was that even the
author
acknowledged that the performance of regular expressions is
significantly
affected by how the regexp are defined. If you keep in mind the basic
concepts
and create your regexp accordingly, you will likely get performance
differences
that even outweigh the differences between the two approaches outlined
in that
article - or putting it another way, poorly specified RE will perform
badly
regardless of the algorithm used. What is important is to do things like
anchoring, using as precise specification as possible and taking
advantage of
any knowledge regarding the data you are processing.

I've not done large (ie. gigabytes of data) processing with RE under
ruby, but
I have done so with perl and the performance was quite acceptable.

There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your
task may
fall well within your performance requirements.

The other way to look at this is to consider what you would do/use as an
alternative. I've also used a lot of Tcl and according to that article,
Tcl
uses the faster algorithm, yet I've never really noticed any great
performance
difference between using Perl or Tcl. So, your choice is to continue and
see if
there is a problem and then deal with that if/when it occurs or jump now
and
start writing your program in Tcl, awk or using grep (maybe even call
grep from
within ruby, but I suspect all performance gains would be lost in
passing the
data between ruby and the grep process).

I've seen many people make a decision regarding the choice of technology
because they have read somewhere that x is faster than y. Often, I've
then seen
something created which is flakey, takes 10x as long to develop or
simply
doesn't work when in reality, the aspect they were concerned about
wasn't even
relevant to the situation they were working in. Recently, I had an
argument
with one of our sys admins who wanted to use the Riser FS rather than
Ext3 as
the file system on a new server. His argument was that Riser had better
performance characteristics and the system would perform better. My
argument
was that file system performance was not a significant bottleneck for
the
server and we would be better off sticking with a file system that had a
better
track record, more robust tools and represented a technology more sys
admins
were familiar with. I lost the argument, initially. The server was
configured
with RiserFS and after a few months, we came in one morning to find
massive
disk curruption problems. The server was changed to Ext3 and has not
missed a
beat since. More to the point, the performance using Ext3 is still well
within
acceptable performance metrics. My point isn't that RiserFS may not be a
good
file system - it probably is and possibly is even "better" than Ext3. My
point
is that speed is not the only issue to consider.

Something which the article doesn't address is complexity and
correctness. The
algorithm used by Perl et. al. may not be fast compared to the
alternative, but
it is relatively simple. As important to performance is correctness. An
extremely fast RE is not a better solution if it is only correct 95% of
the
time or is so complex, it is difficult to maintain without bugs creeping
in
after version updates etc.

Tim
8ae46ae3170a36a0f79ea109ef0ee2e7?d=identicon&s=25 Tim X (Guest)
on 2007-04-12 13:11
(Received via mailing list)
"Robert Dober" <robert.dober@gmail.com> writes:

>>
>> For "pure" regular expressions (which can be matched by NFAs), backtracking and 
NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the 
number of characters, rather than backtracking's worst-case exponential complexity.
>>

Guys, I think your possibly mixing up backtracking and backreferences.
It is
backreferences (referencing a earlier match, often to use as part of the
definition of a latter match) that NFA is not able to do. both
approaches use
backtracking. The NFA approach is able to parallelise the choices,
reducing the
overheads of backtracking. So, given n different possible paths to
follow,
instead of, in the worst case, trying n-1 paths and having to backtrack
after
each one, you can try all n paths at once. At least, I htink thats what
the
article was saying form a very quick read and from what I can remember
from my
computability course and finite automata (oh so many years ago now!).

Tim
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-12 13:21
(Received via mailing list)
On 4/12/07, Tim X <timx@nospam.dev.null> wrote:
> "Robert Dober" <robert.dober@gmail.com> writes:
>
<snip>
>
> Guys, I think your possibly mixing up backtracking and backreferences. It is
> backreferences (referencing a earlier match, often to use as part of the
> definition of a latter match) that NFA is not able to do.
No Tim I was not mixing it up, I said that I was not interested in the
backreferencing history.
You see I am just forgetting the basic stuff but Ola and Mental kindly
gave me a refresher.
> both approaches use
> backtracking.
The DFA cannot use backtracking, surely I am missing something *again*!
>The NFA approach is able to parallelise the choices, reducing the
> overheads of backtracking. So, given n different possible paths to follow,
> instead of, in the worst case, trying n-1 paths and having to backtrack after
> each one, you can try all n paths at once. At least, I htink thats what the
> article was saying form a very quick read and from what I can remember from my
> computability course and finite automata (oh so many years ago now!).
For me it is 25 years and for you?
Yeah I understood that in the same way, actually it is quite simple :)
>
> Tim
> --
> tcross (at) rapttech dot com dot au
>
>
Cheers
Robert
8ae46ae3170a36a0f79ea109ef0ee2e7?d=identicon&s=25 Tim X (Guest)
on 2007-04-12 13:45
(Received via mailing list)
Robert Klemme <shortcutter@googlemail.com> writes:

>>   start: ('' ->a_star | '' ->any_star),
> While this is true in theory there is a number of modern RX features that are
> extremely difficult or impossible to do with a DFA.  So as it stands now for
> pratical purposes most modern RX engines are NFA or hybrids, sometimes (Perl)
> with heavy optimizations to prevent certain ugly effects (see "Mastering
> Regular Expressions" for details).
>
> Kind regards
>

To me, this is possibly a much more important point than just being
concerned
over raw performance. While NFA may be *a lot* better performing in the
majority of cases, if we want to add in the exra power of backreferences
and
other 'nice' (possibly considered advanced) features, things begin to
get a lot
more complicated. Unfortunately, with complexity, we usually find higher
levels
of bugs in the final implementation.

I'm not saying that things should be kept super simple just to ensure
correctness at the cost of performance. However, we should consider
complexity
and correctness/maintainability of implementation as well as just pure
performance. For me and possibly for many others, I find this
particularly
important as performance of regular expressions has never been a
critical issue
with any of the work I've had to use them for (to what extent this is
just a
reflection of the types of problems I deal with I don't know). In fact,
everyone I've ever helped with sorting out performance problems with
their
regexp, it has turned out to be badly defined regexp that were the issue
rather
than poor performance of the engine/underlying algorithm.

Maybe the real question we should be asking is "If ruby had a regexp
engine
which was 10, 100, 1000, 1000000 times faster, would that allow us to
solve
problems which we cannot solve now in any practicle way?

In principal, I think having a regexp engine which possibly adopts the
DFA/NFA
algorithim if backreferences are not required and use a slower
backtracking
approach when they are is a good approach, but only if we can ensure a
stable
and correct implementation.

Tim

Tim
8ae46ae3170a36a0f79ea109ef0ee2e7?d=identicon&s=25 Tim X (Guest)
on 2007-04-12 14:01
(Received via mailing list)
"Robert Dober" <robert.dober@gmail.com> writes:

> No Tim I was not mixing it up, I said that I was not interested in the
> backreferencing history.
> You see I am just forgetting the basic stuff but Ola and Mental kindly
> gave me a refresher.
>> both approaches use
>> backtracking.
> The DFA cannot use backtracking, surely I am missing something *again*!
Doh!

>>The NFA approach is able to parallelise the choices, reducing the
>> overheads of backtracking. So, given n different possible paths to follow,
>> instead of, in the worst case, trying n-1 paths and having to backtrack after
>> each one, you can try all n paths at once. At least, I htink thats what the
>> article was saying form a very quick read and from what I can remember from my
>> computability course and finite automata (oh so many years ago now!).
> For me it is 25 years and for you?
Nearly the same - computability course was 82/83, but I did go back an
do som
postrad in early 90s, so my knowledge cache should have been fresher!
Guess its
>> that bloody LRU memory cache algorithm trying to make room again!

> Yeah I understood that in the same way, actually it is quite simple :)
>>

Its funny, back in the 80..83, the courses I enjoyed the most were
computability, parallel processing, compilers and assembly. Now, when I
think
about anything in those areas, I just have vague recollections and no
real
facts. Since those days, work has been about databases, dull user
interface
programming, 4GLs, web technology etc. Up until 4 months ago, it was an
increaseing deluge of
project management methodologies, bullshitp bingo management speak, HR,
budgets
and sales. Now I'm getting back to bare boned technology and its scary
how much
I've forgotten! Still, at least I'm now looking at Ruby (amongst many
other
things!).

Tim
703fbc991fd63e0e1db54dca9ea31b53?d=identicon&s=25 Robert Dober (Guest)
on 2007-04-12 14:15
(Received via mailing list)
On 4/12/07, Tim X <timx@nospam.dev.null> wrote:
> and sales. Now I'm getting back to bare boned technology and its scary how much
> I've forgotten! Still, at least I'm now looking at Ruby (amongst many other
> things!).
>
Amen, but I think one shall be optimist ;)
> Tim
>
>
> --
> tcross (at) rapttech dot com dot au
>
>
Robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-12 18:12
(Received via mailing list)
On Thu, 12 Apr 2007 16:59:47 +0900, "Robert Dober"
<robert.dober@gmail.com> wrote:
> Now there does not seem to be any practical need to do this
> transformation though as the Thompson NFA algorithm kinda does the
> same thing in runtime, not creating states that are sets of states but
> just keeping a set of states.

The Thompson algorithm essentially just does the NFA -> DFA conversion
lazily.

-mental
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-12 18:32
(Received via mailing list)
On Thu, 12 Apr 2007 16:10:06 +0900, Robert Klemme
<shortcutter@googlemail.com> wrote:
>> Every NFA can be rewritten as a DFA.  Each state in the resulting DFA
>> corresponds to a superposition of states in the original NFA
>
> While this is true in theory there is a number of modern RX features
> that are extremely difficult or impossible to do with a DFA.  So as it
> stands now for pratical purposes most modern RX engines are NFA or
> hybrids,

It's important not to confuse NFAs with backtracking.  For instance,
Thompson's algorithm uses NFAs, but it evaluates by rewriting them
on-the-fly as DFAs, not by performing a backtracking search.  Again,
NFAs and DFAs are equivalent, even if the naive algorithm (backtracking)
for evaluating an NFA is less efficient than the naive algorithm for
evaluating a DFA.

On the other hand, those modern features which really do require
backtracking (surprisingly few) also exceed the capabilities of finite
automata (i.e. those features correspond to non-regular languages), so
at that point the regular expression engines cannot be said to be using
FAs at all, but a more general type of automaton.

Again, though we must obviously keep backtracking in reserve for those
patterns that need it, it is very wasteful not to use a pure FA approach
for those patterns where non-regular features aren't being used.

> sometimes (Perl) with heavy optimizations to prevent certain
> ugly effects (see "Mastering Regular Expressions" for details).

"Mastering Regular Expressions" is very good for learning the pragmatics
of working with contemporary regular expression implementations, but it
is not a good source for theory.

-mental
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-12 18:44
(Received via mailing list)
On Thu, 12 Apr 2007 19:55:09 +0900, Tim X <timx@nospam.dev.null> wrote:
> There is no point worrying about optimisation until you know there is a
> performance issue. For all you know, using the ruby RE engine for your
> task may fall well within your performance requirements.

This is very, very true -- don't optimize until you measure.  For the
sake of the original poster, I want to underscore this point.  Chances
are the Ruby regexp engine is way more than adequate for his purposes.

> The algorithm used by Perl et. al. may not be fast compared to the
> alternative, but it is relatively simple. As important to performance is correctness.

Perl's implementation is not very simple, however...

-mental
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-04-12 19:25
(Received via mailing list)
On 4/13/07, MenTaLguY <mental@rydia.net> wrote:
> On Thu, 12 Apr 2007 16:59:47 +0900, "Robert Dober" <robert.dober@gmail.com> wrote:
> > Now there does not seem to be any practical need to do this
> > transformation though as the Thompson NFA algorithm kinda does the
> > same thing in runtime, not creating states that are sets of states but
> > just keeping a set of states.
>
> The Thompson algorithm essentially just does the NFA -> DFA conversion lazily.

This would be true if the Thompson algorithm was caching the states as
it went. If I recall correctly (it's been a while since I read the
article) it doesn't cache the states and Russ Cox's code in the
article definitely doesn't. So basically, the difference between
Thompson's parallel NFA evaluation and the "naive" backtracking NFA
evaluation is breath first search vs depth first search. The reason
breadth first search is so much faster in this problem is that you
avoid repeatedly going down the same search path. Given an unlimited
amount of memory and using dynamic programming techniques you could
presumably make the backtracking NFA evaluation run as fast as the
Thompson's algorithm. In fact I read somewhere that the Perl regex
engine has started to use dynamic programming, obviously without great
success.

So there are going to be some situations where a DFA will be a lot
faster than a Thompson NFA. But, having said that, I think it is
important to note (and I don't think anyone else has mentioned this)
that it is not always better to compile a regex into a DFA, even if it
is possible. In some instances it will take longer to compile the
regex into a DFA then it will take to evaluate the simple NFA
representation. I don't have any hard data to show this but I'm pretty
sure there will be some cases where Perl's and Ruby's regex engines
will easily beat the awk and grep engines.

By the way, I highly recommend people check out Ville Laurikari's
regex engine TRE;

    http://laurikari.net/tre/

While the predictable worst-case performance is nice, the really cool
thing that I like about this project is the approximate matching. This
is something I'd really like to have in Ruby.
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-12 19:47
(Received via mailing list)
On Fri, 13 Apr 2007 02:25:15 +0900, "David Balmain"
<dbalmain.ml@gmail.com> wrote:
> This would be true if the Thompson algorithm was caching the states as
> it went. If I recall correctly (it's been a while since I read the
> article) it doesn't cache the states and Russ Cox's code in the
> article definitely doesn't.

There is a section in his article entitled "Caching the NFA to build a
DFA" with code doing exactly that.

> In fact I read somewhere that the Perl regex engine has started to use dynamic
> programming, obviously without great success.

It's mentioned in his article as well.

> In some instances it will take longer to compile the
> regex into a DFA then it will take to evaluate the simple NFA
> representation.

Namely, when most of the states in the NFA will not be visited for a
particular input.  That is why the DFA construction is done lazily.

> By the way, I highly recommend people check out Ville Laurikari's
> regex engine TRE;
>
>     http://laurikari.net/tre/
>
> While the predictable worst-case performance is nice, the really cool
> thing that I like about this project is the approximate matching. This
> is something I'd really like to have in Ruby.

That's really cool.

-mental
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-12 19:53
(Received via mailing list)
On Fri, 13 Apr 2007 02:47:34 +0900, MenTaLguY <mental@rydia.net> wrote:
>> In some instances it will take longer to compile the
>> regex into a DFA then it will take to evaluate the simple NFA
>> representation.
>
> Namely, when most of the states in the NFA will not be visited for a
> particular input.  That is why the DFA construction is done lazily.

Sorry, I meant to say that it's when most of the states in the _DFA_
will not be visited for a particular input.

Generally speaking, you can visit all the states in the NFA without
necessarily visiting all the states in the DFA.

-mental
15f985dc1d6a150767736a0c65e9ef35?d=identicon&s=25 Jeremy Henty (Guest)
on 2007-04-12 19:56
(Received via mailing list)
On 2007-04-12, Tim X <timx@nospam.dev.null> wrote:

> The NFA approach is able to parallelise the choices, reducing the
> overheads of backtracking. So, given n different possible paths to
> follow, instead of, in the worst case, trying n-1 paths and having
> to backtrack after each one, you can try all n paths at once.

I think this (and much of the rest of the discussion) rather misses
the point.  The DFA algorithm performs poorly in some circumstances
because it may backtrack to the *same* state *many* times.  The NFA
algorithm wins in such cases because it generates a list of all
feasible states before attempting to progress any of them, and it is
therefore easy to optimise by eliminating duplicates from the list.

Here's a concrete example.  Suppose we are trying to match a regular
expression containing (a?)(a?)(a?)...(a?).  If n is the number of
repetitions of (a?), then in the worst case where the match fails the
DFA explores 2^n possibiilities before failure.  But in the course of
doing that it only enters n(n+1)/2 distinct states.  This is because
there are eg. 3 ways of matching (a?)(a?)(a?) against "aa" (depending
on which of the three a's you consider to be optional) and the DFA
attempts to complete the match for *each* of those possibilities, even
though the calculation is identical in each case.  The only thing that
matters is how many (a?)s you have matched so far, and how many "a"s
you have matched them against, but the DFA isn't smart enough to see
that it is repeating the same calculation 3 times.

The NFA wins because it does *not* search all 2^n combinations of
choices in the regexp, it searches the n(n+1)/2 possible states of the
regexp engine.  The three *different* ways of matching (a?)(a?)(a?)
against "aa" all map to the *same* state of the regexp engine, so the
NFA checks this possibility only once.

So, the reason for the NFA's success is *not* that it "tries all
possibilities at once" or that it "avoids backtracking".  Rather, it
reformulates the problem so that the search space is much smaller.
Equivalently, it implicitly caches the results of previous attempts so
that it can say "hmm, I've been here before and it didn't work then,
so I won't bother trying again".  If you accept the maxim that madness
is trying the same thing in the hope of different results, the DFA
loses because it is insane!  ;-)

Regards,

Jeremy Henty
B5e329ffa0cc78efbfc7ae2d084c149f?d=identicon&s=25 David Balmain (Guest)
on 2007-04-12 20:40
(Received via mailing list)
On 4/13/07, MenTaLguY <mental@rydia.net> wrote:
>
> It's mentioned in his article as well.

Right you are. Sorry, I read the article a couple of months back when
it was posted and I thought I remembered it well enough to comment
without reading it again. I guess not. :-(

> > In some instances it will take longer to compile the
> > regex into a DFA then it will take to evaluate the simple NFA
> > representation.
>
> Namely, when most of the states in the NFA will not be visited for a particular input. 
That is why the DFA construction is done lazily.

True, lazy construction of the DFA definitely seems like a very good
solution in most cases. I guess what I was trying to say though is
that the seemingly naive  backtracking NFA algorithm will actually be
the fastest solution in some cases as there may be little or no
backtracking and even simply caching the DFA states could be
unnecessary overhead. Therefore, there is no one solution that beats
them all. When matching a simple string like /supercalifraglistic/ for
instance, I believe the simple backtracking algorithm would be faster
than Thompson's algorithm. In fact, there are sub-linear expected
algorithms that exist for matching simple strings like this. I'm
currently reading a good book about this called "Flexible Pattern
Matching in Strings" by Gonzalo Navarro and Mathieu Raffinot. Worth a
read if you can get your hands on it.
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-12 21:21
(Received via mailing list)
On Fri, 13 Apr 2007 02:55:20 +0900, Jeremy Henty
<onepoint@starurchin.org> wrote:
> I think this (and much of the rest of the discussion) rather misses
> the point.  The DFA algorithm performs poorly in some circumstances
> because it may backtrack to the *same* state *many* times.

DFA evaluation does not backtrack.  Assuming epsilon transitions are
collapsed when constructing the DFA, for a given input string there will
be exactly n (or less, in the case of failure) transitions taken, where
n is the number of characters in the input string.

Perhaps you meant that the same NFA state is potentially represented in
exponentially many different DFA states?  That would indeed be a
problem, except...

> This is because there are eg. 3 ways of matching (a?)(a?)(a?) against "aa"
> (depending on which of the three a's you consider to be optional)

In the case of regular expressions, this ambiguity is resolved by the
usual "greedy" versus "non-greedy" distinctions, which can be expressed
by assigning weights to transitions in the NFA (thereby pruning the
number of states in the resulting DFA).  This is pretty much the same
way you have to deal with ambiguities when constructing Ragel grammars.

-mental
8ae46ae3170a36a0f79ea109ef0ee2e7?d=identicon&s=25 Tim X (Guest)
on 2007-04-13 02:30
(Received via mailing list)
MenTaLguY <mental@rydia.net> writes:

> Perl's implementation is not very simple, however...
>

True, but I think thats due to all the additional add-ons. The basic
underlying
algorithm is/was fairly straight forward. However, I think even the perl
developers have realised that Perl's RE support was getting out of hand.
Although I've not looked at it, I believe there is significant changes
in perl
6 in this area. I seem to remember reading something from Larry, where
he said
that Perl's functionality in this area has extended to the point that in
reality, you probably couldn't call it regexps anymore, but as that was
the
terminology people were use to, thats what it wold be known as. Perl has
gone
past what was traditionally defined as regexps.

Tim
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-13 03:47
(Received via mailing list)
On Fri, 13 Apr 2007 09:30:07 +0900, Tim X <timx@nospam.dev.null> wrote:
> True, but I think thats due to all the additional add-ons. The basic
> underlying algorithm is/was fairly straight forward.

I was thinking in terms of the optimizations applied to make the
backtracking algorithm faster, though that's a factor too...

> However, I think even the perl developers have realised that Perl's RE support
> was getting out of hand.
> Although I've not looked at it, I believe there is significant changes in
> perl 6 in this area. I seem to remember reading something from Larry, where he
> said that Perl's functionality in this area has extended to the point that in
> reality, you probably couldn't call it regexps anymore, but as that was
> the terminology people were use to, thats what it wold be known as. Perl has
> gone past what was traditionally defined as regexps.

The Perl 6 solution was to promote grammars and rules into first-class
abstractions in the language, on a level with classes and methods.  So
it's basically a general-purpose parsing framework now.  I'm not sure if
that counts as less out-of-hand or not...

-mental
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-13 09:01
(Received via mailing list)
On 12.04.2007 18:31, MenTaLguY wrote:
> On the other hand, those modern features which really do require backtracking 
(surprisingly few) also exceed the capabilities of finite automata (i.e. those features 
correspond to non-regular languages), so at that point the regular expression engines 
cannot be said to be using FAs at all, but a more general type of automaton.
Strinctly speaking yes.  But AFAIK the basis is usually a NFA.

Btw, because of the nature of the engine there are some more differences
between DFA and NFA, for example typically order of expressions in an
alternative matters with a NFA engine.  I prefer that often, because it
allows to control the way the RX engine matches to a certain extent,
i.e. you can prioritize alternatives which can be pretty handy at times.

> Again, though we must obviously keep backtracking in reserve for those patterns that 
need it, it is very wasteful not to use a pure FA approach for those patterns where 
non-regular features aren't being used.
>
>> sometimes (Perl) with heavy optimizations to prevent certain
>> ugly effects (see "Mastering Regular Expressions" for details).
>
> "Mastering Regular Expressions" is very good for learning the pragmatics of working with 
contemporary regular expression implementations, but it is not a good source for theory.

Since I had the theory at university and at the moment I'm more
interested in using them than writing them this is good enough for me.
:-)

Kind regards

  robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-13 22:03
(Received via mailing list)
On Fri, 2007-04-13 at 16:00 +0900, Robert Klemme wrote:
> Btw, because of the nature of the engine there are some more differences
> between DFA and NFA, for example typically order of expressions in an
> alternative matters with a NFA engine.  I prefer that often, because it
> allows to control the way the RX engine matches to a certain extent,
> i.e. you can prioritize alternatives which can be pretty handy at times.

Nope, you can effectively prioritize alternatives with DFAs, the same
way you can specify greedy/non-greedy matches.  NFAs and DFAs _really
are_ equivalent in expressiveness.

-mental
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-13 23:52
(Received via mailing list)
On 13.04.2007 22:02, MenTaLguY wrote:
> On Fri, 2007-04-13 at 16:00 +0900, Robert Klemme wrote:
>> Btw, because of the nature of the engine there are some more differences
>> between DFA and NFA, for example typically order of expressions in an
>> alternative matters with a NFA engine.  I prefer that often, because it
>> allows to control the way the RX engine matches to a certain extent,
>> i.e. you can prioritize alternatives which can be pretty handy at times.
>
> Nope, you can effectively prioritize alternatives with DFAs, the same
> way you can specify greedy/non-greedy matches.  NFAs and DFAs _really
> are_ equivalent in expressiveness.

How do you do that?

Kind regards

  robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-14 06:34
(Received via mailing list)
On Sat, 2007-04-14 at 06:50 +0900, Robert Klemme wrote:
> > Nope, you can effectively prioritize alternatives with DFAs, the same
> > way you can specify greedy/non-greedy matches.  NFAs and DFAs _really
> > are_ equivalent in expressiveness.
>
> How do you do that?

Let's use a real simple example:  /^(?:(ab)|(..))$/

Here, we want capturing subexpression 1 to have priority over capturing
subexpression 2.  This time, I'll use a [] notation to annotate
transitions with capture boundaries and priorities.

First, the NFA:

 start: ('' [start $1]->group_1 | '' [start $2]->group_2),
 group_1: ('a' ->group_1b),
 group_1b: ('b' [end $1,priority 1]->final),
 group_2: (any ->group_2b),
 group_2b: (any [end $2,priority 0]->final)

Now, the equivalent DFA:

 {start}: ('' [start $1,start $2]->{group_1,group_2}),
 {group_1,group_2}: (
   'a' ->{group_1b,group_2b} |
   (any-'a') ->{group_2b}
 ),
 {group_1b,group_2b}: (
   'b' [end $1]->{final} |
   (any-'b') [end $2]->{final}
 )
 {group_2b}: (any [end $2]->{final})

See how that works?

More complex examples can be attacked using the same principles, though
it gets painful to work by hand pretty fast.

-mental
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-14 15:23
(Received via mailing list)
To be fair, I oversimplified a bit because I couldn't think of a better
example until just now.  Though you can prune the DFA with priorities in
the NFA, an expression like /^(?:(abc)|(a)).*$/ still requires more
sophisticated handling, explicitly tracking alternative matches.

Here's an implementation of a Thompson NFA which does Perl-style
capture:

 http://swtch.com/~rsc/regexp/nfa-perl.y.txt

The same approach works if you construct the DFA in advance or via
memoization of the Thompson NFA evaluation (which is probably the better
of the two for reasons already discussed).

-mental
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-16 12:56
(Received via mailing list)
On 14.04.2007 15:23, MenTaLguY wrote:
> The same approach works if you construct the DFA in advance or via
> memoization of the Thompson NFA evaluation (which is probably the better
> of the two for reasons already discussed).

Hmm...  I was more after how you would notate the RX for a DFA based
engine to do the prioritization.  I do not know a DFA based engine that
would actually allow to write a RX pattern to do that.

Of course, when building engines manually you can do a lot - and
probably a lot more than what actual RX engines allow.  But since I'm
interested in *using* them vs. writing RX engines your solution was not
exactly what I expected. :-)

Kind regards

  robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-16 20:55
(Received via mailing list)
On Mon, 16 Apr 2007 19:55:05 +0900, Robert Klemme
<shortcutter@googlemail.com> wrote:
>> Here's an implementation of a Thompson NFA which does Perl-style
>> capture:
>>
>>  http://swtch.com/~rsc/regexp/nfa-perl.y.txt
>>
> Hmm...  I was more after how you would notate the RX for a DFA based
> engine to do the prioritization.

We note choice points in our tags, and also tag those subsequent
transitions which would disprove a particular choice (for instance, for
/^(?:(abc)|(a)).*$/, the second transition in "axy" disproves the first
alternative, and the third transition in "abc" disproves the second
alternative).

Referring to those tags as we encounter them during FA evaluation, we
can then accumulate sets of plausible captures and cull those which are
disproven as we go.

> I do not know a DFA based engine that would actually allow to write a RX
> pattern to do that.

A non-backtracking NFA simulation (i.e. Thompson NFA) operates by
constructing and evaluating the corresponding DFA on-the-fly (each
unique set of NFA states corresponds to a DFA state).  If you memoize
the construction of DFA states, then you've got a straight DFA engine
(the DFA is constructed lazily).  Some RX engines have historically
built the entire DFA up front, but as previously discussed, that
approach is wasteful (many of the DFA states so constructed may never
get visited).

The same techniques (using tagged transitions to drive other code as the
FA is evaluated) are applicable in all three cases (non-backtracking NFA
simulation, dynamic DFA construction, and static DFA construction).

> Of course, when building engines manually you can do a lot - and
> probably a lot more than what actual RX engines allow.

Right, so I thought it appropriate to give an example of an actual RX
engine which allowed the same (and more) as my hand-worked examples by
extending the same class of techniques I had illustrated so far.

> But since I'm interested in *using* them vs. writing RX engines your
> solution was not exactly what I expected. :-)

I'll admit to being a little confused -- you're interested in using a RX
engine rather than implementing one, but you'd rather see a description
and/or hand-worked example of an implementation technique rather than a
link to a finished implementation using the same techniques under
discussion?

-mental
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-17 16:10
(Received via mailing list)
On 16.04.2007 20:54, MenTaLguY wrote:
> On Mon, 16 Apr 2007 19:55:05 +0900, Robert Klemme <shortcutter@googlemail.com> wrote:
>> But since I'm interested in *using* them vs. writing RX engines your
>> solution was not exactly what I expected. :-)
>
> I'll admit to being a little confused -- you're interested in using a RX engine rather 
than implementing one, but you'd rather see a description and/or hand-worked example of an 
implementation technique rather than a link to a finished implementation using the same 
techniques under discussion?

LOL - now I'm confused, too. :-)  I was looking for something like (made
up fake): "To prioritize branches of an alternative in sed you do sed -e
's/foo(!2bar|!1baz)/yummy\1/g'."  Of course this does not work with sed
(at least no implementation of sed that I know).  Are there other tools
that allow to embed something into a RX that will do this
prioritization?

Kind regards

  robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-17 17:44
(Received via mailing list)
On Tue, 17 Apr 2007 23:10:11 +0900, Robert Klemme
<shortcutter@googlemail.com> wrote:
> LOL - now I'm confused, too. :-)  I was looking for something like (made
> up fake): "To prioritize branches of an alternative in sed you do sed -e
> 's/foo(!2bar|!1baz)/yummy\1/g'."  Of course this does not work with sed
> (at least no implementation of sed that I know).  Are there other tools
> that allow to embed something into a RX that will do this prioritization?

OHHHHHHHH.

| is a deterministic choice operator, so the branches are prioritized by default -- 
earlier alternatives take precedence over later ones.  Just write your RX alternatives 
from highest to lowest priority.

Does that answer your question?

-mental
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-17 18:10
(Received via mailing list)
On 17.04.2007 17:44, MenTaLguY wrote:
> On Tue, 17 Apr 2007 23:10:11 +0900, Robert Klemme <shortcutter@googlemail.com> wrote:
>> LOL - now I'm confused, too. :-)  I was looking for something like (made
>> up fake): "To prioritize branches of an alternative in sed you do sed -e
>> 's/foo(!2bar|!1baz)/yummy\1/g'."  Of course this does not work with sed
>> (at least no implementation of sed that I know).  Are there other tools
>> that allow to embed something into a RX that will do this prioritization?
>
> OHHHHHHHH.

I hope it doesn't hurt too much. ;-)

> | is a deterministic choice operator, so the branches are prioritized by default -- 
earlier alternatives take precedence over later ones.  Just write your RX alternatives 
from highest to lowest priority.
>
> Does that answer your question?

Maybe. :-)

I was under the impression that this precisely is a point of difference
between NFA and DFA based engines: sed is DFA based and for all I know
order in an alternative does not matter while it does matter for the
typical NFA engine.  Did I get this wrong?

Kind regards

  robert
912c61d9da47754de7039f4271334a9f?d=identicon&s=25 MenTaLguY (Guest)
on 2007-04-17 20:19
(Received via mailing list)
On Wed, 18 Apr 2007 01:10:07 +0900, Robert Klemme
<shortcutter@googlemail.com> wrote:
> I was under the impression that this precisely is a point of difference
> between NFA and DFA based engines: sed is DFA based and for all I know
> order in an alternative does not matter while it does matter for the
> typical NFA engine.  Did I get this wrong?

Well, for historical reasons, sed implements only "basic" (versus
"extended") regular expressions, so it does not have explicit
alternation via the | operator.  However, we can still look at it for an
example of how a DFA-based engine can handle choosing between
alternatives.

Consider:

  echo "aaaaa" | sed -e 's/^\(a*\)\(a*\)$/\1,\2/'

There are six possible ways to match this regular expression:

  aaaaa,
  aaaa,a
  aaa,aa
  aa,aaa
  a,aaaa
  ,aaaaa

sed always chooses the first alternative.  Choice between alternatives
for explicit alternation would be handled the same way.

-mental
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2007-04-18 09:56
(Received via mailing list)
On 17.04.2007 20:17, MenTaLguY wrote:
> On Wed, 18 Apr 2007 01:10:07 +0900, Robert Klemme <shortcutter@googlemail.com> wrote:
>> I was under the impression that this precisely is a point of difference
>> between NFA and DFA based engines: sed is DFA based and for all I know
>> order in an alternative does not matter while it does matter for the
>> typical NFA engine.  Did I get this wrong?
>
> Well, for historical reasons, sed implements only "basic" (versus "extended") regular 
expressions, so it does not have explicit alternation via the | operator.  However, we can 
still look at it for an example of how a DFA-based engine can handle choosing between 
alternatives.

Ah!  I reckon this a specialty of GNU sed:

09:45:46 [~]: sed -ne 's#a\|b#A#gp' <<X
 > aaaa
 > bbbb
 > X
AAAA
AAAA
09:46:02 [~]:

On an old Solaris box I get

bash-2.03# sed -ne 's#a|b#A#gp' <<X
 > aaaaa
 > bbbbb
 > X
bash-2.03#

Of course GNU sed is more capable than "standard" sed...

>   a,aaaa
>   ,aaaaa
>
> sed always chooses the first alternative.  Choice between alternatives for explicit 
alternation would be handled the same way.

Kind regards

  robert
This topic is locked and can not be replied to.