Ruby language parser in ruby

stickstone · December 4, 2009, 2:55pm

On 11/30/09, Ryan D. [email protected] wrote:

On Nov 30, 2009, at 10:17 , Roger P. wrote:

redparse

Last time I looked at redparse I couldn’t even get the tests to run. Only

I wish you had/would let me know about whatever problem(s) it was that
you had, so I could/can fix it. I’ve definitely fixed a great many
problems since the first release, but I can’t fix bugs that I don’t
know about.

through lots of hacking did I get it executing and that was just to see how
fast it was.

I’ll be the last to claim that redparse is (right now) a fast parser.
It’s probably (still) the slowest ruby parser in the universe, tho it
should be somewhat faster than it was originally.

Caleb found a horrible ruby package that both redparse and
ruby_parser choke on (think: it’ll finish right around the time of heat
death of the universe… and will probably contribute greatly to it).

Ah, yes, japanese/zipcodes.rb. A 65MB ruby file… it’s an
interesting test case, since any reasonable parser ought to be able to
parse it all before the cows come home. It wasn’t until I talked to
you about it that I realized that my parser (and yours too? still?)
slows down as N2 (or is it eN), where N is the number of tokens in
input. Fundamentally, both parsing algorithms are O(K*N) (for a fairly
large K), but since both build up a parse tree as they go, and
sometimes the garbage collector has to run in the background, which
has to visit everything in that partial parse tree as it goes, the
time for garbage collection runs in the background increases as the
length of the input. Overall this makes for unreasonably super-linear
slowdowns of the algorithm. I had never thought that just adding
garbage collection to an algorithm could take it from O(N) to O(N**2).

Usually, this isn’t an issue since the parser finishes for any
non-ridiculously sized input (eg <1MB) before the garbage collector
needs to run. I suppose that using a generational garbage collector
would return things to O(N) land. Reducing the garbage production rate
of the parser might help a lot too.

On the other hand, why does this particular data set need to be
expressed as executable ruby code, when something like CSV seems so
much more reasonable…

Otherwise, it didn’t seem to be faster or better in any area (esp given how
much work it was to get it to work at all).

I will readily admit that your parser has (for now) the advantage in
terms of speed, ability to run in MRI 1.9, and ability to parse 1.9
expressions. However, it also presents a fairly ugly interface to the
user. parse_tree/ruby_parser trees are these yucky lisp-like things
which behave in various unexpected ways. Whereas redparse trees are a
great deal nicer, IMNSHO. I specifically designed redparse to have
pleasant parsetrees as its output; as I see it, redparse’s tree format
has a number of advantages over the lisp-like output
parse_tree/ruby_parser or the smalltalk-like output of RubyNode:

it’s object oriented
it closely mirrors structure of original source code
it has better positioning information (line #/byte offset)
it behaves in predicable, expectable ways in most cases
things like begin…end and def…end have action unified in one
node, instead of spread out over multiple nodes nested in strange ways
operators are actually a separate type of node, instead of being a
weirdly named method call

The first of those points is really the key one. Having an
object-oriented interface makes the parse trees much nicer to deal
with programatically than anything else. However, the user need not be
limited to that; he can use redparse’s parse trees in list-oriented or
hash-oriented if for some reason those prove to be better. So really,
redparse presents the best of all 3 worlds; its interface is
object-oriented, list-oriented, or hash-oriented depending on what
best suits the user.

Now, all of these things have been true about redparse basically from
the beginning, altho the interface has changed slightly (I think for
the better) over time. Perhaps you prefer the parse_tree style
interface; in fact I’m certain that you do. But when I’m doing the
kind of deep metaprogramming which requires access to a parse tree, I
want a tree format that’s going to be as simple, regular, and
predictable as possible. Ruby’s syntax is fairly complex, and any type
of parse tree representing that syntax is going to inevitably reflect
a fair amount of that complexity as a result. But there is a virtue to
be had in making the resulting trees as simple as it’s possible for
them to be, rather than squirrelly and convoluted in unnecessary extra
ways.

stickstone · December 4, 2009, 5:31pm

On 12/4/09, Brian C. [email protected] wrote:

A minor problem with the redparse gem is that it gives most of the files
root-only permissions. I just did ‘sudo gem install redparse’

Yes, that is indeed most unfortunate and due to be fixed in the next
release, which is coming Real Soon Now. I’m not sure how those
permissions got all weird (again); it’s not something I usually give a
lot of thought or attention to, I’m afraid. (Frankly, I wish rubygems
would have warned me when I created the gem that some of the files
were not world-readable.)

Looking at this code - I don’t think I would dare hack it. I think
ruby_parser is more what I was looking for.

Would you care to elaborate on that? What didn’t you like and/or find
hard to understand? What could I do better? Actually, redparse is a
fairly normal LALR-based parser, the code divides internally into
definitions, rules, and actions. I did invent my own LR language, tho,
being as I’m so unhappy with yacc and friends.

I’m interested to hear, to tell the truth, what sorts of things you
want to change in the existing ruby grammar; I’m very much like
playing around with extending the language in various ways. I have a
variety of ideas I want to pursue myself, but I’m always interested to
hear what new features other people might want.

stickstone · December 4, 2009, 5:42pm

Caleb C. wrote:

On 12/4/09, Brian C. [email protected] wrote:

A minor problem with the redparse gem is that it gives most of the files
root-only permissions. I just did ‘sudo gem install redparse’

Yes, that is indeed most unfortunate and due to be fixed in the next
release, which is coming Real Soon Now. I’m not sure how those
permissions got all weird (again); it’s not something I usually give a
lot of thought or attention to, I’m afraid. (Frankly, I wish rubygems
would have warned me when I created the gem that some of the files
were not world-readable.)

Looking at this code - I don’t think I would dare hack it. I think
ruby_parser is more what I was looking for.

Would you care to elaborate on that? What didn’t you like and/or find
hard to understand?

Well, all of it really.

I presume that lib/redparse.rb is the “real” parser (I also found
lib/redparse/babyparser.rb and babynodes.rb)

It’s a monolithic file, and it was hard even to see where the grammar
began. I believe it’s here:

[
-[UNOP, Expr, lower_op]>>UnOpNode,
-[DEFOP, ParenedNode]>>UnOpNode,
-[Op(/^(?:unary|lhs|rhs)\*$/), ValueNode, lower_op]>>UnaryStarNode,
... etc

and an example of a larger rule is like this:

-[NumberToken&-{:negative=>true}, Op(‘**’).la]>>
stack_monkey(“fix_neg_exp”,2,Op(“-@”,true)){|stack|
#neg_op.unary=true
num=stack[-2]
op=OperatorToken.new(“-@”,num.offset)

op.startline=num.startline

    stack[-2,0]=op
    num.ident.sub!(/\A-/,'')
    num.offset+=1
  },

I have no idea how to (a) understand, or (b) modify that. I can see
there is quite a lot of Ruby operator abuse going on, but without
defined semantics.

At least with racc, it’s extremely well documented, in the sense that I
have a printout of the yacc manual to refer to.

So no doubt RedParse is a fine ruby parser, and generates a fine object
tree as its output. But it’s not so good for me as a starting point for
building languages which inherit some of ruby flavour, but are
significantly different.

stickstone · December 4, 2009, 10:41pm

Caleb C. wrote:

This particular
example fixes up the precedence of expressions like -2**10. ‘-2’ is
normally lexed as one single numeric token, as is normal in most
languages. In this one special case, however, the -@ must actually be
made lower precedence than **

Ugh. No doubt that sort of thing will bite me.

I’d still like to hear more about this language(s) you’re trying to
make, if you want to tell.

I’m going to try to implement a ‘ruby flavoured erlang’, a front-end
which spits out regular erlang which is compiled in the normal way, to
see what such a language might look like.

stickstone · December 4, 2009, 9:28pm

On 12/4/09, Brian C. [email protected] wrote:

Caleb C. wrote:

Would you care to elaborate on that? What didn’t you like and/or find
hard to understand?

Well, all of it really.

I presume that lib/redparse.rb is the “real” parser (I also found
lib/redparse/babyparser.rb and babynodes.rb)

Yes, that’s right. babyparser.rb is an example of a minimal (3 rule)
parser using this system, and can make a good place to start trying to
understand the full-blown parser.

It’s a monolithic file, and it was hard even to see where the grammar
began. I believe it’s here:

Unfortunately, there’s a lot of experimental stuff which shouldn’t be
in there, related to my attempt to write a proper parser compiler.

[
-[UNOP, Expr, lower_op]>>UnOpNode,
-[DEFOP, ParenedNode]>>UnOpNode,
-[Op(/^(?:unary|lhs|rhs)\*$/), ValueNode, lower_op]>>UnaryStarNode,
... etc

These are indeed the rules. The general form is of a rule is
-[ patterns to search for on the top of parse stack ] >>
NodeTypeTheyGetReplacedWith,

These 3 rules deal respectively with (most) unary operators, the
defined? operator, and unary star operators. (What’s found to the
right of the >> is generally a pretty good clue as to what a
particular rule is doing. (But not always in the case of stack
monkeys…)) UNOP, DEFOP, Expr, lower_op and the like are defined
above in the definitions section.

Altho the action to take on finding a pattern on the parse stack is
not always to reduce the matched portion of the stack into a Node. For
instance, here:

    num.offset+=1
  },

That’s not really what I’d call a larger rule, merely (alas) a longer
one… This is an example of one of many relatively unimportant rules
with which the parser must unfortunately be littered. This particular
example fixes up the precedence of expressions like -2**10. ‘-2’ is
normally lexed as one single numeric token, as is normal in most
languages. In this one special case, however, the -@ must actually be
made lower precedence than **. The implementation of this fixup can’t
be neatly shoehorned into a Node constructor, however, so special
imperative code (a ‘stack monkey’) had to be written to fiddle with
the parse stack directly.

I have no idea how to (a) understand, or (b) modify that. I can see
there is quite a lot of Ruby operator abuse going on, but without
defined semantics.

I would call that a ‘DSL’. Most of the special operators and other
unusual syntax are defined in my pattern matching language, Reg, which
is a different project. Reg is moderately well documented, but that’s
in a whole other directory.

At least with racc, it’s extremely well documented, in the sense that I
have a printout of the yacc manual to refer to.

I just couldn’t ever get yacc to do what I wanted it to do,
personally. Lots of other people have had more luck…

I would like some day (if I ever have time) to split out the parser
construction tool aspects or redparse from the actual ruby parser
itself, and package and document the parser compiler/interpreter
better. For now, altho I have made an effort to make the interface to
RedParse fairly clear and well described, the internals I simply
didn’t even try to explain…

So no doubt RedParse is a fine ruby parser, and generates a fine object
tree as its output. But it’s not so good for me as a starting point for
building languages which inherit some of ruby flavour, but are
significantly different.

I can explain more IF you’re interested, but it does seem like you
know where you want to go right now.

I’d still like to hear more about this language(s) you’re trying to
make, if you want to tell.

stickstone · December 4, 2009, 11:53pm

On Dec 4, 2009, at 13:45 , Brian C. wrote:

way?
My default answer to nearly any of these types of questions is going to
be “because that’s how MRI does it”.

In this case, yes, the lexer and parser are keeping a shared variable
called lex_state that knows whether it is in the middle of an expression
(and many other states) so here the trailing ‘+’ keeps the expression
open.

lex_state needs to die. It basically only exists because the language is
a tangled mess and it was designed with LR parsing in mind (AFAICT,
lex_state is a symptom of not really knowing where you are contextually,
because you’re parsing bottom up). I’d like to not have lex_state and
many of the complications that come with it. I’m not sure if it’ll still
be ruby at that point, but I’m giving it a go to see.

I am thinking it ought to be possible to do this in the grammar,
e.g.

expr: expr ‘+’ opt_nl expr
| expr ‘-’ opt_nl expr

opt_nl:
| nl

you’re going to have those EVERYWHERE… but yeah, it should be
possible. Write good tests from the beginning.

Also: can you summarise how you’re handling expressions nested within
string literals, e.g. “abc #{foo} def” ?

There is no way to summarize that. It is horrible and I hate it, but I’m
not in a position to make it work better with the current architecture.

stickstone · December 4, 2009, 10:45pm

Ryan D. wrote:

I think your best bet at this time is ruby_parser.

Question for you: as far as I can make out, the handling of embeeded
linebreaks such as

a = b +
c

is handled via state kept in the lexer. Can I ask why you did it this
way? I am thinking it ought to be possible to do this in the grammar,
e.g.

expr: expr ‘+’ opt_nl expr
| expr ‘-’ opt_nl expr

opt_nl:
| nl

but if you’ve found out the hard way that it isn’t, it could save me
following a dead end.

Also: can you summarise how you’re handling expressions nested within
string literals, e.g. “abc #{foo} def” ?

Thanks,

Brian.