On Thu, 12 Apr 2007 03:27:04 +0900, “Robert D.”
[email protected] wrote:
No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match
using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this
(after collapsing most of the epsilon transitions introduced by
Thompson’s Algorithm):
start: (‘’ ->a_star | ‘’ ->any_star),
a_star: (‘a’ ->a_star | any ->any_star),
any_star: (any ->any_star | ‘a’ =>final),
final: (any ->final)
Here, → is a regular transition, and => is a labeled transition which
sets a counter (initialized to -1) to the current input position if that
is larger than the counter’s current value. When we have explored all
possible matches, the value of the counter will be the position of the
last character in the (greedy) match.
The DFA of course could never backtrack but is there a DFA for this regex?
Every NFA can be rewritten as a DFA. Each state in the resulting DFA
corresponds to a superposition of states in the original NFA (again in
pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a,
b, and c):
{start,a_star,any_star}: (
‘a’ =>{a_star,any_star,final} |
(any-‘a’) ->{any_star}
),
{any_star}: (
‘a’ =>{any_star,final} |
(any-‘a’) ->{any_star}
),
{a_star,any_star,final}: (
‘a’ =>{a_star,any_star,final} |
(any-‘a’) ->{any_star,final}
),
{any_star,final}: (
‘a’ =>{any_star,final} |
(any-‘a’) ->{any_star,final}
)
The resulting FA is deterministic, and {a_star,any_star,final} and
{any_star,final} are both accepting states.
If we need to capture subexpressions, we can introduce additional
counters and labeled transitions with priorities to resolve
greedy/non-greedy ambiguities (as is often done in Ragel to resolve
ambiguities). Prioritized transitions are also necessary for regexps
which are not anchored at the beginning.
-mental