Building the finite state machine of a Ruby regexp

Gilles_L · August 2, 2006, 12:02am

Does anyone know if there is a nice way to build the Finite State
Machine
(FSM) of a Ruby regexp ?

Assuming the native ruby regexp parser used a (more or less explicit)
FSM
I first thought that there was some builtin way to access or re-build it
but, after reading some reference docs (Regexp class) and posts I can’t
see any useful methods to do this. But I would be happy to be wrong
about
this…

As an alternative solution, I wonder if someone knows about a module
that
could parse Ruby regexps (accepting exactly the same regexp grammar as
Ruby) and let me read the associated FSM (or whatever graph-like
structure
representing the regexp).

Any help will be greatly appreciated. Thanks sincerely.

Gilles_L · August 3, 2006, 2:23pm

On 8/1/06, Gilles L. [email protected] wrote:

Does anyone know if there is a nice way to build the Finite State Machine
(FSM) of a Ruby regexp ?

I don’t know of anything specifically oriented to Ruby Regexps that
does this, but surely such tools exist for (say) Perl…?

As an alternative solution, I wonder if someone knows about a module that
could parse Ruby regexps (accepting exactly the same regexp grammar as
Ruby) and let me read the associated FSM (or whatever graph-like structure
representing the regexp).

Given how notoriously opaque they are, I’ve actually found regular
expressions to be pretty easy to parse. I can send you a partial
Regexp parser from one of my projects that just extracted the
information I needed to know at the time… but probably what you
really want is Simon S.'s regexp-engine. You can find it
here:

http://rubyforge.org/projects/aeditor/

Gilles_L · August 3, 2006, 2:23pm

On Aug 2, 2006, at 22:44, Caleb C. wrote:

On 8/1/06, Gilles L. [email protected] wrote:

Does anyone know if there is a nice way to build the Finite State
Machine
(FSM) of a Ruby regexp ?

I don’t know of anything specifically oriented to Ruby Regexps that
does this, but surely such tools exist for (say) Perl…?

Strictly speaking, there isn’t anything that does this, since modern
‘extended’ regexes (a set which includes Ruby’s regexes) aren’t
always representable as finite state automata (i.e., they can
recognise more than the class of regular languages).

As an alternative solution, I wonder if someone knows about a
module that
could parse Ruby regexps (accepting exactly the same regexp
grammar as
Ruby) and let me read the associated FSM (or whatever graph-like
structure
representing the regexp).

Sorry, can’t help you there either - I don’t know and couldn’t dig up
anything that exposes Ruby’s regex internals (or Oniguruma’s) in the
way you’re asking for, which strikes me as the only way to use
exactly the same grammar as Ruby does. Like Caleb C.
suggested, though, there are probably partial solutions out there,
depending on how flexible your criteria are and exactly what
information you need - are you looking for an aid in building
regexes? just visualising them? some other sort of analysis?

matthew smillie.

Gilles_L · August 4, 2006, 12:22am

On Thu, 03 Aug 2006 09:22:14 +0900, Matthew S. wrote :

(…) Like Caleb C.
suggested, though, there are probably partial solutions out there,
depending on how flexible your criteria are and exactly what
information you need - are you looking for an aid in building
regexes? just visualising them? some other sort of analysis?

What I’d like to do is, given any regexp :

To test quickly if a particular string matches this regexp : that’s
why I’d prefer to use native Ruby regexes. They are the natural choice
when programming in Ruby and offer very good performance (because the
parser is compiled into the Ruby interpreter from a well-established C
code base (GNU regexps) that should be reasonably well optimized after
decades of public exposure).
To generate “random” strings that match the same regexp : that’s why
I
was thinking about using a Finite State Machine (or kind of it since, as
you pointed out ‘modern’ regexp are not alway representable as FSM) :
starting from the initial state and choosing randomly an outbound edge,
I can add a (first) character to the string. Repeating the process from
the node I arrived on, I can add a second character, and so on… (then
I will also have to make sure that I finally reach the final state in a
reasonable time but that’s another problem).

Gilles L…