Parser as an alternative to RegExen

robertjames · February 22, 2007, 3:15am

I’m parsing a large file, currently using compound regexen:

PREAMBLE = ‘AA’
USERID = ‘\d{8}’
USER_HELLO = “#{PREMABLE}(#{USERID})”

Is there a simple way to do this using a parser such as ANTLR? I’ve
never used one before, so if it requires a learning curve, I’ll stick
to my regexen.

But if there is a cleaner way to do this, I’d certainly like to.

robertjames · February 22, 2007, 4:26am

On Feb 21, 2007, at 8:15 PM, S. Robert J. wrote:

I’m parsing a large file, currently using compound regexen:

PREAMBLE = ‘AA’
USERID = ‘\d{8}’
USER_HELLO = “#{PREMABLE}(#{USERID})”

Is there a simple way to do this using a parser such as ANTLR? I’ve
never used one before, so if it requires a learning curve, I’ll stick
to my regexen.

I really don’t think there’s any value in going all the way to a
parser generator here. This job looks to be squarely in the Regexp
domain, so there’s no reason to feel bad about using them.

James Edward G. II

robertjames · February 22, 2007, 4:50am

On Thu, Feb 22, 2007 at 12:26:12PM +0900, James Edward G. II wrote:

to my regexen.

I really don’t think there’s any value in going all the way to a
parser generator here. This job looks to be squarely in the Regexp
domain, so there’s no reason to feel bad about using them.

Agreed.

OTOH, Parsers are sure fun to write! (esp. rec descent ones for simple
grammars).

If you do decide to go with a parser generator, check out Dhaka,
http://dhaka.rubyforge.org/

robertjames · February 22, 2007, 8:45am

On 22.02.2007 04:26, James Edward G. II wrote:

to my regexen.

I really don’t think there’s any value in going all the way to a parser
generator here. This job looks to be squarely in the Regexp domain, so
there’s no reason to feel bad about using them.

Agree. Also, in Ruby Regexp objects can nicely be used to build larger
expressions because Regexp#to_s is nicely implemented to retain all the
settings:

irb(main):001:0> PREAMBLE = /AA/
=> /AA/
irb(main):002:0> USERID = /\d{8}/
=> /\d{8}/
irb(main):003:0> USER_HELLO = /#{PREAMBLE}(#{USERID})/
=> /(?-mix:AA)((?-mix:\d{8}))/

That way you can make sure that all sub expressions are valid and you
can nicely mix options - if you need to (for example, preamble case
insensitive).

Kind regards

robert

robertjames · February 22, 2007, 9:13am

S. Robert J. wrote:

But if there is a cleaner way to do this, I’d certainly like to.

As other people has mentioned, there is no biggie using Regexps for
this. BUT, another approach which I deem really nice is to use Ragel.
Ragel is a generator for Finite State Machines which recently got a
backend for Ruby (it’s only in version control yet).

The regexps would look almost the same, but the speed would be increase
greatly.

–
Ola B. (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

“Yields falsehood when quined” yields falsehood when quined.

robertjames · February 22, 2007, 11:34am

On Thu, 22 Feb 2007 03:15:09 +0100, S. Robert J.
[email protected] wrote:

But if there is a cleaner way to do this, I’d certainly like to.

One instance where I’d be thinking of picking up parser-fu would be if
the
data contains recursively nested structures of some sort. Either the
regexes, or the ancillary code juggling them gets hairy anyway, losing
you
the simplicity, and you still have to work your way through the nesting
levels manually, which an AST parser would do for you.

David V.

robertjames · February 22, 2007, 5:33am

If you’re just looking to get the job done, you should stick with
regexes.
Your example doesn’t look like it has the kind of expression constructs
that
would justify applying a full-fledged parser.

On the other hand, if you have some time to kill or think your language
could get a little more elaborate, then check out Dhaka by all means.
It
is still somewhat cumbersome since tokenizers have to be hand-written,
but
this is about to change.

Mushfeq.