Alternative Ruby grammar

I have spent the last months to write an alternative Ruby grammar now
registered at rubyforge.org under the name “Ruby top down grammar”.

The grammar is hosting language neutral. It must be interpreted or
translated to be run, i.e. to parse something. Currently there are two
translators, one to Emacs Lisp, the other to C. Both produce recursive
descent parsers.

The whole grammar is written in 270 rules taking 1500 lines. It
unifies lexical and syntactic analyses.

Another popular naming for such a grammar is Parsing Expression
Grammar. I extended some forms I found absolutely necessary. It’s a
neutral and minimal but still practical grammar definition
language. The variation used here contains 31 different forms. The svn
repo contains a description of it.

Without being sure, I’d like to claim the grammar is close to cover
100% of the Ruby language. It does, for example, parse Ruby stdlib
completely.

If you get to play around with this grammar and discover valid Ruby
code that doesn’t parse correctly or not at all then please contact
me. Warnings are not yet implemented.

In direct comparison to the MRI parser there are some pros and cons.

The advantages are less complexity and more flexibility. I got default
values for block formals working (believed impossible with MRI parser;
but let me emphasize this grammar doesn’t primarily want to extend
Ruby. I was just seeking what’s possible).

Also this grammar is hosting language neutral, meaning, one could
write translators to arbitrary languages and then parse Ruby code from
that language. For example, when translated to (or interpreted by)
JavaScript you could colorify the Ruby snippets on your blog at the
client side. (This grammar project does not (yet) support JavaScript
and people have found another working solution to color Ruby code)

On the bad side parsers using this grammar work slower. Even the
faster of both implementations is many times slower than the MRI
parser.

Also, the abstract syntax trees produced by this grammar have a
different structure than the ones produced by MRI parser.

Currently there are two implementations. The first is a translator to
Emacs Lisp, the other a translator to C. Both care not only for
parsing but also for the construction of abstract syntax trees and
pretty printing.

They do both memoize temporary parsing results. This is more
complicated than in ordinary Packrat parsers, since I’ve introduced
something I call “modes” in the grammar definition language, which
needs to be cared for in memoizing. Also, there is a feature in there
to “mark” arbitrary parsed words and to check whether a word is
marked. (If you read till here you might have guessed it. It’s used
for local variables). That’s also something memoization has to take
account of.

If you are interested, take a look at

http://rubyforge.org/projects/ruby-tp-dw-gram/

You can get the Subversion repository with

svn checkout svn://rubyforge.org/var/svn/ruby-tp-dw-gram

To get you started fast: The command “make gram” in the checkout
directory should compile the C parser. Then you can pipe-in Ruby code
like with:
echo ‘foo(bar 6,7,8)’ | ./gram

Markus L.

Hi,
n
At Fri, 16 Nov 2007 16:20:02 +0900,
Markus L. wrote in [ruby-talk:279241]:

The advantages are less complexity and more flexibility. I got default
values for block formals working (believed impossible with MRI parser;
but let me emphasize this grammar doesn’t primarily want to extend
Ruby. I was just seeking what’s possible).

I don’t think it impossible, and I’ve posted a patch once.

Markus L. wrote:

I have spent the last months to write an alternative Ruby grammar now
registered at rubyforge.org under the name “Ruby top down grammar”.

The grammar is hosting language neutral. It must be interpreted or
translated to be run, i.e. to parse something. Currently there are two
translators, one to Emacs Lisp, the other to C. Both produce recursive
descent parsers.

I would be interested in hearing more about the translation process, and
the possibility of producing RDPs in Ruby and Java.

Without being sure, I’d like to claim the grammar is close to cover
100% of the Ruby language. It does, for example, parse Ruby stdlib
completely.

I’m not sure how much a measure that is; a parser could parse every
token as a literal “1” and it would parse everything, but it wouldn’t
mean it’s correct. Perhaps it’s possible to roundtrip from the parsed
result back to Ruby code and see whether the result is roughly the same
as the original?

On the bad side parsers using this grammar work slower. Even the
faster of both implementations is many times slower than the MRI
parser.

Do you expect this can be improved? If there’s a performance hit for
using this parser it will substantially limit adoption.

Also, the abstract syntax trees produced by this grammar have a
different structure than the ones produced by MRI parser.

No big deal for JRuby at least; I don’t expect it would take more than a
few days to write a new interpreter based on the new AST. The compiler
might take a few more days beyond that.

What’s your goal with this? At the moment, I don’t like that there’s
only two “mostly correct” parsers in existence: Ruby’s Bison-based
parser and JRuby’s Jay-based parser. They’re both pretty painful to work
with and evolve.

  • Charlie

On Nov 15, 2007, at 23:20 , Markus L. wrote:

Another popular naming for such a grammar is Parsing Expression
Grammar. I extended some forms I found absolutely necessary. It’s a
neutral and minimal but still practical grammar definition
language. The variation used here contains 31 different forms. The svn
repo contains a description of it.

Dude… Congratulations! That is damn impressive!

I’ve tried, TWICE, to do an LR to LL flip on the ruby grammar (sans-
PEG) and failed both times… My brain is just not that big. Using PEG
definitely seems to be the way to go. We hung out and talked to the
primary implementor of ometa at OOPSLA and are probably pushing in
that direction over the medium to long term.

Only now that I’ve got Parsetree’s Uber Test Zuite(*) would I attempt
such a task again, and hesitantly at that.

I noticed a lot of elisp in your source tree… please tell me that
you’ve hooked this up to ruby-mode! :stuck_out_tongue:

(*) PUTZ is not the most creative acronym… it could use some help.

 Parse Tree Easy Werification Yay! = PTEWY!

 I dunno...

Ryan D. wrote:

you’ve hooked this up to ruby-mode! :stuck_out_tongue:

(*) PUTZ is not the most creative acronym… it could use some help.

Parse Tree Easy Werification Yay! = PTEWY!

But we need two names, one for the grammar and one for the test suite.
The grammar is easy: “ARRRG” - Another Really Robust Ruby Grammar".

But the test suite … how about “STAPLES” – Suite to Test All of
ParseTree’s Logic, Execution and Stuff.

Charles Oliver N. wrote:

What’s your goal with this? At the moment, I don’t like that there’s
only two “mostly correct” parsers in existence: Ruby’s Bison-based
parser and JRuby’s Jay-based parser. They’re both pretty painful to work
with and evolve.

Isn’t there also an ANTLR parser? What about Nathan Sobo’s “Treetop”?

On Nov 16, 2007, at 07:27 , M. Edward (Ed) Borasky wrote:

Charles Oliver N. wrote:

What’s your goal with this? At the moment, I don’t like that
there’s only two “mostly correct” parsers in existence: Ruby’s
Bison-based parser and JRuby’s Jay-based parser. They’re both
pretty painful to work with and evolve.

Isn’t there also an ANTLR parser? What about Nathan Sobo’s “Treetop”?

there is no complete ruby parser based on antlr yet that I know of.

treetop is not a ruby parser, it is a ruby parser generator (like
antlr).

On Nov 16, 2007 1:20 AM, Markus L. [email protected] wrote:

I have spent the last months to write an alternative Ruby grammar now
registered at rubyforge.org under the name “Ruby top down grammar”.

The grammar is hosting language neutral. It must be interpreted or
translated to be run, i.e. to parse something. Currently there are two
translators, one to Emacs Lisp, the other to C. Both produce recursive
descent parsers.

Great work Mark!

How does this grammar compare with the one the antlr one done by the
“grammarians”? Did you start from there or from scratch?

I looks like you have a unified lexer/parser. With all of the ruby
parser
dependent lexer states, I can see why this might be easier. But, this
is
probably also why you might be losing performance (and why you need
memoization), right? I’m guessing there might be some backtracking
needed?

Do you have (E)BNF of this LL grammar, or just gram.lisp?

I’m thinking one of my next steps for my ruby-forge Grammar package (LL
grammar specified in BNF-like ruby) would be to write a (E)BNF parser to
be
able to translate BNF from antlr/yacc/racc/etc. Even though mine is LL,
I
handle some amount of left recursion. Of course, I’d
like to target a ruby (E)BNF working.

On Nov 16, 2007, at 07:54 , Eric M. wrote:

I looks like you have a unified lexer/parser. With all of the ruby
parser
dependent lexer states, I can see why this might be easier. But,
this is
probably also why you might be losing performance (and why you need
memoization), right? I’m guessing there might be some backtracking
needed?

PEGs take care of that rather efficiently.

On 11/16/07, Ryan D. [email protected] wrote:

there is no complete ruby parser based on antlr yet that I know of.

Depending on what you view as “complete”, there is the XRuby ANTLR-based
parser:

http://xruby.googlecode.com/svn/trunk/src/com/xruby/compiler/parser/ruby.g

However, the parser actions are heavily intertwined with XRuby
internals, so it would appear to be difficult to extract and re-use
with a different implementation (even for JRuby, we don’t see much
upside in trying to port their parser versus the Jay-generated parser
that JRuby already uses).

/Nick

On Nov 16, 2007, at 10:13 , Nick S. wrote:

Depending on what you view as “complete”, there is the XRuby ANTLR-
based parser:

Maybe they’ve improved it since last time I looked at it… It
couldn’t parse much of anything the first (couple?) time it was
announced.

On Nov 16, 2007 8:26 AM, Charles Oliver N. [email protected]
wrote:

At the moment, I don’t like that there’s
only two “mostly correct” parsers in existence: Ruby’s Bison-based
parser and JRuby’s Jay-based parser.
Hi,

I’m curious, what do you mean by “mostly correct” in the context of MRI?

Regards,
Sean

On Nov 16, 9:26 am, Charles Oliver N. [email protected]
wrote:

Markus L. wrote:

The grammar is hosting language neutral. It must be interpreted or
translated to be run, i.e. to parse something. Currently there are two
translators, one to Emacs Lisp, the other to C. Both produce recursive
descent parsers.

I would be interested in hearing more about the translation process, and
the possibility of producing RDPs in Ruby and Java.

You may have a look in the file cp-gen.el. That’s the elisp code to
generate the C parser. Every one of the 31 forms I mentioned before
expands nearly directly to some C block with very little analysis done
before. There are lots of temporary variables generated for various
purposes and its left to the C compiler to sort out the mess (which
gcc does pretty well).

If one doesn’t want to use code generation, one might create 31
classes, each of them doing the work of one form. Following the
Interpreter design pattern. The grammar reader would create object
configurations mirroring the rules in the grammar. Then you are not
able to use local variables for the capture variables and a few
others. That would cost speed again.

Without being sure, I’d like to claim the grammar is close to cover
100% of the Ruby language. It does, for example, parse Ruby stdlib
completely.

I’m not sure how much a measure that is; a parser could parse every
token as a literal “1” and it would parse everything, but it wouldn’t
mean it’s correct. Perhaps it’s possible to roundtrip from the parsed
result back to Ruby code and see whether the result is roughly the same
as the original?

hmm? There are many correct parses, as seen in the file tests.el.
There may be many bugs too.

The unparser won’t catch “interesting” bugs like wrong priority.

On the bad side parsers using this grammar work slower. Even the
faster of both implementations is many times slower than the MRI
parser.

Do you expect this can be improved? If there’s a performance hit for
using this parser it will substantially limit adoption.

Well I didn’t ever benchmark MRI’s parser before writing this grammar.
The problem for any other parser is that it is blindingly fast. Still
I don’t believe speed is that important. To give you at least one
number: One my specific hardware, the C parser groks around 3 MBytes
of Ruby code per second. Does this seem fast or slow to you?

With careful work one may make the C parser maybe two times faster,
maybe more. I don’t think I will spend my time on that.

What’s your goal with this?

I will give another try at building a Ruby VM. I didn’t want to use
one of those parse.y adapters. I think its really nice and useful to
have a portable parser.

On Nov 16, 10:14 am, Ryan D. removed[email protected] wrote:

descent parsers.
Dude… Congratulations! That is damn impressive!
thanks.

I’ve tried, TWICE, to do an LR to LL flip on the ruby grammar (sans-
PEG) and failed both times… My brain is just not that big. Using PEG
definitely seems to be the way to go. We hung out and talked to the
primary implementor of ometa at OOPSLA and are probably pushing in
that direction over the medium to long term.

Only now that I’ve got Parsetree’s Uber Test Zuite(*) would I attempt
such a task again, and hesitantly at that.

You’ll write another parser? Explain that.

I noticed a lot of elisp in your source tree… please tell me that
you’ve hooked this up to ruby-mode! :stuck_out_tongue:

Hmm. You are asking for too much. And there is also ruby-cedet.

For regular use within ruby-mode, one should be able to make this
parser incremental. Instead of calling rules you would manage
invocation and return yourself. Keep this stack in an array or
list. Every point where you might need to restart (every five lines,
or whatever) you clone the array or simply keep the list. There are
probably some issues with memoization; Do you need to keep the full
memoization table? Which entries can be deleted?
But in general I believe it is that simple.

On Nov 16, 4:54 pm, Eric M. [email protected] wrote:


Great work Mark!

thanks.

How does this grammar compare with the one the antlr one done by the
“grammarians”? Did you start from there or from scratch?

I did start from scratch, trying to keep the underlying generator
definition language as simple as possible.

I looks like you have a unified lexer/parser. With all of the ruby parser
dependent lexer states, I can see why this might be easier. But, this is
probably also why you might be losing performance (and why you need
memoization), right? I’m guessing there might be some backtracking needed?

Thanks to memoization this grammar runs in linear time (mostly, I
believe). Over-linear time (without memoization) is caused by only
a few rules that directly or indirectly apply other rules multiple
times at the same input position.

And I’m using selective memoization. Only rules that profit from it
get memoized, since lookup in the memoization table costs too.

That’s one of the things not yet fully optimized in the C parser. I
spent some time determining the set of memoized rules when writing the
elisp backend, but didn’t do the same again for the C parser, though
the relative costs of various operations may differ greatly.

But I don’t understand your proposition about backtracking. PEG based
parser do naturally backtrack at certain situations. Definit Clause
Grammar from Prolog are “more” backtracking, they care to find every
possible parse, which PEG do not.

If you’d write (or aa bb cc) a PEG parser will try to parse first
aa. If this succeeds, it will never try bb or cc at the given
position. Thats why they are called non-exhaustively backtracking.

Do you have (E)BNF of this LL grammar, or just gram.lisp?

No, there is no other format of this grammar. gram.lisp is the source.

I’m thinking one of my next steps for my ruby-forge Grammar package (LL
grammar specified in BNF-like ruby) would be to write a (E)BNF parser to be
able to translate BNF from antlr/yacc/racc/etc. Even though mine is LL, I
handle some amount of left recursion. Of course, I’d
like to target a ruby (E)BNF working.

You might have a look at the file deflang.txt. All the forms I used
are described there. If you wanted to adapt this grammar you need
solutions for such arcane forms as “postpone-rest-of-line” which is
needed for parsing here-docs and “again” which parses a previously
seen string again. Just two examples.

Ryan D. wrote:

On Nov 16, 2007, at 10:13 , Nick S. wrote:

Depending on what you view as “complete”, there is the XRuby
ANTLR-based parser:

Maybe they’ve improved it since last time I looked at it… It couldn’t
parse much of anything the first (couple?) time it was announced.

As I’ve heard it they’re able to compile the full stdlib, so I presume
that means they can parse it.

However, when I occasionally play with XRuby, I frequently find
something not quite right. Last time, they were missing flip-flop.

  • Charlie

Sean O’Halpin wrote:

On Nov 16, 2007 8:26 AM, Charles Oliver N. [email protected] wrote:

At the moment, I don’t like that there’s
only two “mostly correct” parsers in existence: Ruby’s Bison-based
parser and JRuby’s Jay-based parser.
Hi,

I’m curious, what do you mean by “mostly correct” in the context of MRI?

If MRI is considered the gold standard, then obviously it’s 100%
correct. Of course, it can’t be considered the gold standard, since
minor things change from release to release; so I’d say all parsers
would be some percentage of “complete” given some
implementation-independent definition of “Ruby”.

But perhaps that’s just my opinion :slight_smile: It’s hard to say anyone is
“correct” when there’s no language specification.

  • Charlie

Charles Oliver N. wrote:

If MRI is considered the gold standard, then obviously it’s 100%

Which is why all implementations are “extended subsets” of a
non-existing language standard. :slight_smile: Every language works that way, at
least until it has been around long enough for someone to deem it
necessary to establish a standards committee and thrash out a formal
definition. Some languages get to this point quickly – FORTRAN, for
example. Other languages take a long time – Forth, for example. I doubt
seriously if Perl will ever get there. :slight_smile:

Markus L. wrote:

What’s your goal with this?

I will give another try at building a Ruby VM. I didn’t want to use
one of those parse.y adapters. I think its really nice and useful to
have a portable parser.

I thought your name looked familiar – you were the lead on the
“Carbone” project to build a gforth/vmgen based Ruby virtual machine,
right? Have you picked that up again, and is this part of it?

On Nov 17, 9:47 am, Markus L. [email protected] wrote:

Well I didn’t ever benchmark MRI’s parser before writing this grammar.
The problem for any other parser is that it is blindingly fast. Still
I don’t believe speed is that important. To give you at least one
number: One my specific hardware, the C parser groks around 3 MBytes
of Ruby code per second. Does this seem fast or slow to you?

With careful work one may make the C parser maybe two times faster,
maybe more. I don’t think I will spend my time on that.

I’ve checked in a file pending-optimizations.txt. I won’t work
on that for now but at least its written up somewhere.