Ruby Forum Ruby > Alternative Ruby grammar

Posted by Markus Liedl (Guest)
on 16.11.2007 08:20
(Received via mailing list)
I have spent the last months to write an alternative Ruby grammar now
registered at rubyforge.org under the name "Ruby top down grammar".

The grammar is hosting language neutral. It must be interpreted or
translated to be run, i.e. to parse something. Currently there are two
translators, one to Emacs Lisp, the other to C. Both produce recursive
descent parsers.

The whole grammar is written in 270 rules taking 1500 lines. It
unifies lexical and syntactic analyses.

Another popular naming for such a grammar is Parsing Expression
Grammar. I extended some forms I found absolutely necessary. It's a
neutral and minimal but still practical grammar definition
language. The variation used here contains 31 different forms. The svn
repo contains a description of it.


Without being sure, I'd like to claim the grammar is close to cover
100% of the Ruby language. It does, for example, parse Ruby stdlib
completely.

If you get to play around with this grammar and discover valid Ruby
code that doesn't parse correctly or not at all then please contact
me. Warnings are not yet implemented.


In direct comparison to the MRI parser there are some pros and cons.

The advantages are less complexity and more flexibility. I got default
values for block formals working (believed impossible with MRI parser;
but let me emphasize this grammar doesn't primarily want to extend
Ruby. I was just seeking what's possible).

Also this grammar is hosting language neutral, meaning, one could
write translators to arbitrary languages and then parse Ruby code from
that language. For example, when translated to (or interpreted by)
JavaScript you could colorify the Ruby snippets on your blog at the
client side. (This grammar project does not (yet) support JavaScript
and people have found another working solution to color Ruby code)


On the bad side parsers using this grammar work slower. Even the
faster of both implementations is many times slower than the MRI
parser.

Also, the abstract syntax trees produced by this grammar have a
different structure than the ones produced by MRI parser.


Currently there are two implementations. The first is a translator to
Emacs Lisp, the other a translator to C. Both care not only for
parsing but also for the construction of abstract syntax trees and
pretty printing.

They do both memoize temporary parsing results. This is more
complicated than in ordinary Packrat parsers, since I've introduced
something I call "modes" in the grammar definition language, which
needs to be cared for in memoizing. Also, there is a feature in there
to "mark" arbitrary parsed words and to check whether a word is
marked. (If you read till here you might have guessed it. It's used
for local variables). That's also something memoization has to take
account of.



If you are interested, take a look at

   http://rubyforge.org/projects/ruby-tp-dw-gram/

You can get the Subversion repository with

   svn checkout svn://rubyforge.org/var/svn/ruby-tp-dw-gram


To get you started fast: The command "make gram" in the checkout
directory should compile the C parser. Then you can pipe-in Ruby code
like with:
   echo 'foo(bar 6,7,8)' | ./gram


Markus Liedl
Posted by Nobuyoshi Nakada (nobu)
on 16.11.2007 09:18
(Received via mailing list)
Hi,
n
At Fri, 16 Nov 2007 16:20:02 +0900,
Markus Liedl wrote in [ruby-talk:279241]:
> The advantages are less complexity and more flexibility. I got default
> values for block formals working (believed impossible with MRI parser;
> but let me emphasize this grammar doesn't primarily want to extend
> Ruby. I was just seeking what's possible).

I don't think it impossible, and I've posted a patch once.
Posted by Charles Oliver Nutter (Guest)
on 16.11.2007 09:28
(Received via mailing list)
Markus Liedl wrote:
> I have spent the last months to write an alternative Ruby grammar now
> registered at rubyforge.org under the name "Ruby top down grammar".
> 
> The grammar is hosting language neutral. It must be interpreted or
> translated to be run, i.e. to parse something. Currently there are two
> translators, one to Emacs Lisp, the other to C. Both produce recursive
> descent parsers.

I would be interested in hearing more about the translation process, and
the possibility of producing RDPs in Ruby and Java.

> Without being sure, I'd like to claim the grammar is close to cover
> 100% of the Ruby language. It does, for example, parse Ruby stdlib
> completely.

I'm not sure how much a measure that is; a parser could parse every
token as a literal "1" and it would parse everything, but it wouldn't
mean it's correct. Perhaps it's possible to roundtrip from the parsed
result back to Ruby code and see whether the result is roughly the same
as the original?

> On the bad side parsers using this grammar work slower. Even the
> faster of both implementations is many times slower than the MRI
> parser.

Do you expect this can be improved? If there's a performance hit for
using this parser it will substantially limit adoption.

> Also, the abstract syntax trees produced by this grammar have a
> different structure than the ones produced by MRI parser.

No big deal for JRuby at least; I don't expect it would take more than a
few days to write a new interpreter based on the new AST. The compiler
might take a few more days beyond that.

What's your goal with this? At the moment, I don't like that there's
only two "mostly correct" parsers in existence: Ruby's Bison-based
parser and JRuby's Jay-based parser. They're both pretty painful to work
with and evolve.

- Charlie
Posted by Ryan Davis (Guest)
on 16.11.2007 10:15
(Received via mailing list)
On Nov 15, 2007, at 23:20 , Markus Liedl wrote:

>
> Another popular naming for such a grammar is Parsing Expression
> Grammar. I extended some forms I found absolutely necessary. It's a
> neutral and minimal but still practical grammar definition
> language. The variation used here contains 31 different forms. The svn
> repo contains a description of it.

Dude... Congratulations! That is damn impressive!

I've tried, TWICE, to do an LR to LL flip on the ruby grammar (sans-
PEG) and failed both times... My brain is just not that big. Using PEG
definitely seems to be the way to go. We hung out and talked to the
primary implementor of ometa at OOPSLA and are probably pushing in
that direction over the medium to long term.

Only now that I've got Parsetree's Uber Test Zuite(*) would I attempt
such a task again, and hesitantly at that.

I noticed a lot of elisp in your source tree... please tell me that
you've hooked this up to ruby-mode! :P

(*) PUTZ is not the most creative acronym... it could use some help.

     Parse Tree Easy Werification Yay! = PTEWY!

     I dunno...
Posted by M. Edward (Ed) Borasky (Guest)
on 16.11.2007 16:09
(Received via mailing list)
Ryan Davis wrote:
>>
> 
> you've hooked this up to ruby-mode! :P
> 
> (*) PUTZ is not the most creative acronym... it could use some help.
> 
>     Parse Tree Easy Werification Yay! = PTEWY!

But we need *two* names, one for the grammar and one for the test suite.
The grammar is easy: "ARRRG" - Another Really Robust Ruby Grammar".

But the test suite ... how about "STAPLES" -- Suite to Test All of
ParseTree's Logic, Execution and Stuff.
Posted by M. Edward (Ed) Borasky (Guest)
on 16.11.2007 16:28
(Received via mailing list)
Charles Oliver Nutter wrote:
> What's your goal with this? At the moment, I don't like that there's 
> only two "mostly correct" parsers in existence: Ruby's Bison-based 
> parser and JRuby's Jay-based parser. They're both pretty painful to work 
> with and evolve.

Isn't there also an ANTLR parser? What about Nathan Sobo's "Treetop"?
Posted by Eric Mahurin (Guest)
on 16.11.2007 16:55
(Received via mailing list)
On Nov 16, 2007 1:20 AM, Markus Liedl <m.liedl@gmx.de> wrote:

> I have spent the last months to write an alternative Ruby grammar now
> registered at rubyforge.org under the name "Ruby top down grammar".
>
> The grammar is hosting language neutral. It must be interpreted or
> translated to be run, i.e. to parse something. Currently there are two
> translators, one to Emacs Lisp, the other to C. Both produce recursive
> descent parsers.
>

Great work Mark!

How does this grammar compare with the one the antlr one done by the
"grammarians"?  Did you start from there or from scratch?

I looks like you have a unified lexer/parser.  With all of the ruby 
parser
dependent lexer states, I can see why this might be easier.  But, this 
is
probably also why you might be losing performance (and why you need
memoization), right?  I'm guessing there might be some backtracking 
needed?

Do you have (E)BNF of this LL grammar, or just gram.lisp?

I'm thinking one of my next steps for my ruby-forge Grammar package (LL
grammar specified in BNF-like ruby) would be to write a (E)BNF parser to 
be
able to translate BNF from antlr/yacc/racc/etc.  Even though mine is LL, 
I
handle some amount of left recursion.  Of course, I'd
like to target a ruby (E)BNF working.
Posted by Ryan Davis (Guest)
on 16.11.2007 18:58
(Received via mailing list)
On Nov 16, 2007, at 07:27 , M. Edward (Ed) Borasky wrote:

> Charles Oliver Nutter wrote:
>> What's your goal with this? At the moment, I don't like that  
>> there's only two "mostly correct" parsers in existence: Ruby's  
>> Bison-based parser and JRuby's Jay-based parser. They're both  
>> pretty painful to work with and evolve.
>
> Isn't there also an ANTLR parser? What about Nathan Sobo's "Treetop"?

there is no complete ruby parser based on antlr yet that I know of.

treetop is not a ruby parser, it is a ruby parser generator (like
antlr).
Posted by Ryan Davis (Guest)
on 16.11.2007 18:59
(Received via mailing list)
On Nov 16, 2007, at 07:54 , Eric Mahurin wrote:

> I looks like you have a unified lexer/parser.  With all of the ruby  
> parser
> dependent lexer states, I can see why this might be easier.  But,  
> this is
> probably also why you might be losing performance (and why you need
> memoization), right?  I'm guessing there might be some backtracking  
> needed?

PEGs take care of that rather efficiently.
Posted by Nick Sieger (Guest)
on 16.11.2007 19:14
(Received via mailing list)
On 11/16/07, Ryan Davis <ryand-ruby@zenspider.com> wrote:
>
> there is no complete ruby parser based on antlr yet that I know of.

Depending on what you view as "complete", there is the XRuby ANTLR-based 
parser:

http://xruby.googlecode.com/svn/trunk/src/com/xruby/compiler/parser/ruby.g

However, the parser actions are heavily intertwined with XRuby
internals, so it would appear to be difficult to extract and re-use
with a different implementation (even for JRuby, we don't see much
upside in trying to port their parser versus the Jay-generated parser
that JRuby already uses).

/Nick
Posted by Ryan Davis (Guest)
on 17.11.2007 02:10
(Received via mailing list)
On Nov 16, 2007, at 10:13 , Nick Sieger wrote:

> Depending on what you view as "complete", there is the XRuby ANTLR- 
> based parser:

Maybe they've improved it since last time I looked at it... It
couldn't parse much of anything the first (couple?) time it was
announced.
Posted by Sean O'halpin (sean)
on 17.11.2007 03:33
(Received via mailing list)
On Nov 16, 2007 8:26 AM, Charles Oliver Nutter <charles.nutter@sun.com> 
wrote:
> At the moment, I don't like that there's
> only two "mostly correct" parsers in existence: Ruby's Bison-based
> parser and JRuby's Jay-based parser.
Hi,

I'm curious, what do you mean by "mostly correct" in the context of MRI?

Regards,
Sean
Posted by Markus Liedl (Guest)
on 17.11.2007 09:50
(Received via mailing list)
On Nov 16, 9:26 am, Charles Oliver Nutter <charles.nut...@sun.com>
wrote:
> Markus Liedl wrote:
> > The grammar is hosting language neutral. It must be interpreted or
> > translated to be run, i.e. to parse something. Currently there are two
> > translators, one to Emacs Lisp, the other to C. Both produce recursive
> > descent parsers.
>
> I would be interested in hearing more about the translation process, and
> the possibility of producing RDPs in Ruby and Java.

You may have a look in the file cp-gen.el. That's the elisp code to
generate the C parser. Every one of the 31 forms I mentioned before
expands nearly directly to some C block with very little analysis done
before. There are lots of temporary variables generated for various
purposes and its left to the C compiler to sort out the mess (which
gcc does pretty well).

If one doesn't want to use code generation, one might create 31
classes, each of them doing the work of one form.  Following the
Interpreter design pattern. The grammar reader would create object
configurations mirroring the rules in the grammar. Then you are not
able to use local variables for the capture variables and a few
others. That would cost speed again.


> > Without being sure, I'd like to claim the grammar is close to cover
> > 100% of the Ruby language. It does, for example, parse Ruby stdlib
> > completely.
>
> I'm not sure how much a measure that is; a parser could parse every
> token as a literal "1" and it would parse everything, but it wouldn't
> mean it's correct. Perhaps it's possible to roundtrip from the parsed
> result back to Ruby code and see whether the result is roughly the same
> as the original?

hmm? There are many correct parses, as seen in the file tests.el.
There may be many bugs too.

The unparser won't catch "interesting" bugs like wrong priority.

> > On the bad side parsers using this grammar work slower. Even the
> > faster of both implementations is many times slower than the MRI
> > parser.
>
> Do you expect this can be improved? If there's a performance hit for
> using this parser it will substantially limit adoption.

Well I didn't ever benchmark MRI's parser before writing this grammar.
The problem for any other parser is that it is blindingly fast.  Still
I don't believe speed is that important. To give you at least one
number: One my specific hardware, the C parser groks around 3 MBytes
of Ruby code per second. Does this seem fast or slow to you?

With careful work one may make the C parser maybe two times faster,
maybe more.  I don't think I will spend my time on that.


> What's your goal with this?

I will give another try at building a Ruby VM. I didn't want to use
one of those parse.y adapters. I think its really nice and useful to
have a portable parser.
Posted by Markus Liedl (Guest)
on 17.11.2007 10:15
(Received via mailing list)
On Nov 16, 10:14 am, Ryan Davis <ryand-r...@zenspider.com> wrote:
> > descent parsers.
> Dude... Congratulations! That is damn impressive!
thanks.

> I've tried, TWICE, to do an LR to LL flip on the ruby grammar (sans-
> PEG) and failed both times... My brain is just not that big. Using PEG
> definitely seems to be the way to go. We hung out and talked to the
> primary implementor of ometa at OOPSLA and are probably pushing in
> that direction over the medium to long term.

> Only now that I've got Parsetree's Uber Test Zuite(*) would I attempt
> such a task again, and hesitantly at that.

You'll write another parser? Explain that.

> I noticed a lot of elisp in your source tree... please tell me that
> you've hooked this up to ruby-mode! :P

Hmm. You are asking for too much. And there is also ruby-cedet.

For regular use within ruby-mode, one should be able to make this
parser incremental. Instead of calling rules you would manage
invocation and return yourself. Keep this stack in an array or
list. Every point where you might need to restart (every five lines,
or whatever) you clone the array or simply keep the list. There are
probably some issues with memoization; Do you need to keep the full
memoization table? Which entries can be deleted?
But in general I believe it is that simple.
Posted by Markus Liedl (Guest)
on 17.11.2007 10:36
(Received via mailing list)
On Nov 16, 4:54 pm, Eric Mahurin <eric.mahu...@gmail.com> wrote:
> ...
> Great work Mark!

thanks.

> How does this grammar compare with the one the antlr one done by the
> "grammarians"?  Did you start from there or from scratch?

I did start from scratch, trying to keep the underlying generator
definition language as simple as possible.

> I looks like you have a unified lexer/parser.  With all of the ruby parser
> dependent lexer states, I can see why this might be easier.  But, this is
> probably also why you might be losing performance (and why you need
> memoization), right?  I'm guessing there might be some backtracking needed?

Thanks to memoization this grammar runs in linear time (mostly, I
believe). Over-linear time (without memoization) is caused by only
a few rules that directly or indirectly apply other rules multiple
times at the same input position.

And I'm using selective memoization. Only rules that profit from it
get memoized, since lookup in the memoization table costs too.

That's one of the things not yet fully optimized in the C parser. I
spent some time determining the set of memoized rules when writing the
elisp backend, but didn't do the same again for the C parser, though
the relative costs of various operations may differ greatly.


But I don't understand your proposition about backtracking.  PEG based
parser do naturally backtrack at certain situations.  Definit Clause
Grammar from Prolog are "more" backtracking, they care to find every
possible parse, which PEG do not.

If you'd write (or aa bb cc) a PEG parser will try to parse first
aa. If this succeeds, it will never try bb or cc at the given
position. Thats why they are called non-exhaustively backtracking.


> Do you have (E)BNF of this LL grammar, or just gram.lisp?

No, there is no other format of this grammar. gram.lisp is the source.


> I'm thinking one of my next steps for my ruby-forge Grammar package (LL
> grammar specified in BNF-like ruby) would be to write a (E)BNF parser to be
> able to translate BNF from antlr/yacc/racc/etc.  Even though mine is LL, I
> handle some amount of left recursion.  Of course, I'd
> like to target a ruby (E)BNF working.


You might have a look at the file deflang.txt. All the forms I used
are described there. If you wanted to adapt this grammar you need
solutions for such arcane forms as "postpone-rest-of-line" which is
needed for parsing here-docs and "again" which parses a previously
seen string again. Just two examples.
Posted by M. Edward (Ed) Borasky (Guest)
on 17.11.2007 21:02
(Received via mailing list)
Markus Liedl wrote:
>> What's your goal with this?
> 
> I will give another try at building a Ruby VM. I didn't want to use
> one of those parse.y adapters. I think its really nice and useful to
> have a portable parser.


I thought your name looked familiar -- you were the lead on the
"Carbone" project to build a gforth/vmgen based Ruby virtual machine,
right? Have you picked that up again, and is this part of it?
Posted by Charles Oliver Nutter (Guest)
on 18.11.2007 02:18
(Received via mailing list)
Ryan Davis wrote:
> 
> On Nov 16, 2007, at 10:13 , Nick Sieger wrote:
> 
>> Depending on what you view as "complete", there is the XRuby 
>> ANTLR-based parser:
> 
> Maybe they've improved it since last time I looked at it... It couldn't 
> parse much of anything the first (couple?) time it was announced.

As I've heard it they're able to compile the full stdlib, so I presume
that means they can parse it.

However, when I occasionally play with XRuby, I frequently find
something not quite right. Last time, they were missing flip-flop.

- Charlie
Posted by Charles Oliver Nutter (Guest)
on 18.11.2007 02:21
(Received via mailing list)
Sean O'Halpin wrote:
> On Nov 16, 2007 8:26 AM, Charles Oliver Nutter <charles.nutter@sun.com> wrote:
>> At the moment, I don't like that there's
>> only two "mostly correct" parsers in existence: Ruby's Bison-based
>> parser and JRuby's Jay-based parser.
> Hi,
> 
> I'm curious, what do you mean by "mostly correct" in the context of MRI?

If MRI is considered the gold standard, then obviously it's 100%
correct. Of course, it can't be considered the gold standard, since
minor things change from release to release; so I'd say all parsers
would be some percentage of "complete" given some
implementation-independent definition of "Ruby".

But perhaps that's just my opinion :) It's hard to say anyone is
"correct" when there's no language specification.

- Charlie
Posted by M. Edward (Ed) Borasky (Guest)
on 18.11.2007 02:30
(Received via mailing list)
Charles Oliver Nutter wrote:
> If MRI is considered the gold standard, then obviously it's 100% 
> 
Which is why all implementations are "extended subsets" of a
non-existing language standard. :) *Every* language works that way, at
least until it has been around long enough for someone to deem it
necessary to establish a standards committee and thrash out a formal
definition. Some languages get to this point quickly -- FORTRAN, for
example. Other languages take a long time -- Forth, for example. I doubt
seriously if Perl will *ever* get there. :)
Posted by Ryan Davis (Guest)
on 18.11.2007 07:39
(Received via mailing list)
On Nov 17, 2007, at 17:17 , Charles Oliver Nutter wrote:

> However, when I occasionally play with XRuby, I frequently find  
> something not quite right. Last time, they were missing flip-flop.

To be fair... flip-flop isn't quit right.
Posted by Markus Liedl (Guest)
on 18.11.2007 14:10
(Received via mailing list)
On Nov 17, 9:01 pm, "M. Edward (Ed) Borasky" <zn...@cesmail.net>
wrote:
> Markus Liedl wrote:
> >> What's your goal with this?
>
> > I will give another try at building a Ruby VM. I didn't want to use
> > one of those parse.y adapters. I think its really nice and useful to
> > have a portable parser.
>
> I thought your name looked familiar -- you were the lead on the
> "Carbone" project to build a gforth/vmgen based Ruby virtual machine,
> right? Have you picked that up again, and is this part of it?

No, it's not part of it. I'm starting from scratch again.
Along the lines of David Ungars SOAR and Ian Piumartas thesis; from
parse
tree directly to machine code.
Posted by Markus Liedl (Guest)
on 18.11.2007 16:36
(Received via mailing list)
On Nov 17, 9:47 am, Markus Liedl <m.li...@gmx.de> wrote:
> Well I didn't ever benchmark MRI's parser before writing this grammar.
> The problem for any other parser is that it is blindingly fast.  Still
> I don't believe speed is that important. To give you at least one
> number: One my specific hardware, the C parser groks around 3 MBytes
> of Ruby code per second. Does this seem fast or slow to you?
>
> With careful work one may make the C parser maybe two times faster,
> maybe more.  I don't think I will spend my time on that.

I've checked in a file pending-optimizations.txt. I won't work
on that for now but at least its written up somewhere.
Posted by Sean O'halpin (sean)
on 18.11.2007 20:14
(Received via mailing list)
On Nov 18, 2007 1:20 AM, Charles Oliver Nutter <charles.nutter@sun.com> 
wrote:
> If MRI is considered the gold standard, then obviously it's 100%
>
Thanks for the clarification. I guess that's the problem with a de
facto vs a de jure standard - it can be a moving target.

By the way, the progress you're making on JRuby really is a delight - 
great job.

Regards,
Sean