On Thu, 12 Apr 2007 03:27:04 +0900, “Robert D.”

removed_email_address@domain.invalid wrote:

No it must be me who is confused, I thought that backtracking cannot

be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy

match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this

(after collapsing most of the epsilon transitions introduced by

Thompson’s Algorithm):

start: (’’ ->a_star | ‘’ ->any_star),

a_star: (‘a’ ->a_star | any ->any_star),

any_star: (any ->any_star | ‘a’ =>final),

final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which

sets a counter (initialized to -1) to the current input position if that

is larger than the counter’s current value. When we have explored all

possible matches, the value of the counter will be the position of the

last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA

corresponds to a superposition of states in the original NFA (again in

pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a,

b, and c):

{start,a_star,any_star}: (

‘a’ =>{a_star,any_star,final} |

(any-‘a’) ->{any_star}

),

{any_star}: (

‘a’ =>{any_star,final} |

(any-‘a’) ->{any_star}

),

{a_star,any_star,final}: (

‘a’ =>{a_star,any_star,final} |

(any-‘a’) ->{any_star,final}

),

{any_star,final}: (

‘a’ =>{any_star,final} |

(any-‘a’) ->{any_star,final}

)

The resulting FA is deterministic, and {a_star,any_star,final} and

{any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional

counters and labeled transitions with priorities to resolve

greedy/non-greedy ambiguities (as is often done in Ragel to resolve

ambiguities). Prioritized transitions are also necessary for regexps

which are not anchored at the beginning.

-mental