Parsing, BNF, TreeTop, GhostWheel,

Hi,

I’m looking for a parser which can be fed with some (A)BNF-style rules.

Gave TreeTop a try
—cut---------------------------------------------------------
grammar MathGrammar
rule additive
multitive ‘+’ additive / multitive
end
rule multitive
primary '’ multitive / primary
end
rule primary
‘(’ additive ‘)’ / number
end
rule number
[1-9] [0-9]

end
end
—cut---------------------------------------------------------

as well as GhostWheel
—cut---------------------------------------------------------
MathParser = GhostWheel.build_parser {

rule( :additive , alt( :multitive,
seq( :multitive, ‘+’, :additive )
))
rule( :multitive , alt( seq( :primary, ‘*’, :multitive ),
:primary
))
rule( :primary , alt( seq( ‘(’, :additive, ‘)’ ),
:number
))
rule( :number , seq( /[1-9]/ , zplus( /0-9/ ) ))

parser( :main, seq(:additive, eof()) )
}
—cut---------------------------------------------------------

‘1+1’ parses fine.
However when I change the definition of “additive” from
multitive ‘+’ additive / multitive
to
multitive / multitive ‘+’ additive
it fails to parse.

Is this a problem with packrat/PEG parsers in general?
If so, which parser is more appropriate?
It should hand back a parse tree.

Memory consumption is not an issue.

Regards,
Philipp

Philipp K. wrote:

I’m looking for a parser which can be fed with some (A)BNF-style rules.

If so, which parser is more appropriate?
It should hand back a parse tree.

Alternatively I’d appreciate if Regexps could return all
occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?

Regards,
Philipp

On 12 Aug 2010, at 19:52, Philipp K. wrote:

Regards,
Philipp

I was trying to do the same thing and asked about repeated named
captures.

It seems that using String#scan is the closest anything Ruby has, as the
Oniguruma regex engine doesn’t support it. I think it’s a real shame as
named capture groups are really useful.

Regards,
Iain

On 13.08.2010 00:50, Iain B. wrote:

occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?

As I understand you want /f(o)+/ =~ “foo” to return [“o”, “o”] as match
for the group (used normal capturing groups for simplicity).

I was trying to do the same thing and asked about repeated named captures.
http://osdir.com/ml/ruby-talk/2010-07/msg00361.html

It seems that using String#scan is the closest anything Ruby has, as the Oniguruma regex engine doesn’t support it. I think it’s a real shame as named capture groups are really useful.

Regular expression engines generally do only return one match per group

  • regardless of naming or not naming groups. I’m not up to date with
    current Perl’s regular expressions which are the only ones I can imagine
    to be crazy enough to provide such a feature. :slight_smile: Otherwise String#scan
    is indeed the proper tool to get multiple matches. The example above
    could be done like this

if /f(o+)/ =~ s
$1.scan /o/ do |match|
p match
end
end

or this

if /f(o+)/ =~ s
matches = $1.scan /o/
p matches
end

Kind regards

robert

On 13 Aug 2010, at 11:15, Robert K. wrote:

Regular expression engines generally do only return one match per group - regardless of naming or not naming groups. I’m not up to date with current Perl’s regular expressions which are the only ones I can imagine to be crazy enough to provide such a feature. :slight_smile:

.Net definitely offers it, which is ironic as none of the .Net devs I
used to work with ever used regex.

Scan is a really poor alternative (for this kind of task), I think.
Instead of just writing one regex which expresses what I want to do much
better, I end up having to write a lot of extra code to get the same
effect. Scan’s cool for more simple things.

Regards
Iain

Robert K. wrote:

instead of just the last match for each name. Feasible?

As I understand you want /f(o)+/ =~ “foo” to return [“o”, “o”] as match
for the group (used normal capturing groups for simplicity).

FRUIT = ‘(? Apple|Banana|Pear|Orange)’
FRUIT_COLLECTION = '(?<FRUIT_COLLECTION> ’ << FRUIT << ‘*)’

re = Regexp.new( ‘^’ << FRUIT_COLLECTION << ‘$’, Regexp::EXTENDED )
re.match( ‘PearBananaApple’ )[:FRUIT]
=> “Apple”

I would want e.g. matchdata[:FRUIT] to be an Array
[ ‘Pear’, ‘Banana’, ‘Apple’ ]
and not just ‘Apple’ (the last FRUIT).

Actually something similar to a concrete syntax tree / parse tree
would be even better:

ROOT “PearBananaApple”
FRUIT_COLLECTION “PearBananaApple”
FRUIT “Pear”
FRUIT “Banana”
FRUIT “Apple”

the Oniguruma regex engine doesn’t support it. I think it’s a real shame as named capture groups are really useful.

Exactly.
Anyway thanks for your input.

Regards,
Philipp

(Sent previously, but it didn’t appear. Sorry if you get this twice).

Philipp K. wrote:

Is this a problem with packrat/PEG parsers in general?

Yes. PEG parsers are greedy (iterate as far as possible and never
backtrack on that), and also take the first alternative that succeeds,
without trying subsequent alternatives. It’s actually quite useful
and natural, once you get the hang of it.

Pure PEGs have no lookahead, as Treetop does. It’s often necessary
though, in the light of the above.

Also, your approach will yield the wrong results when you add ‘-’ into
the mix, because of associativity. “1-2-3” will calculate as 1-(2-3).
A better way is to use iteration, see

http://github.com/nathansobo/treetop/blob/master/examples/lambda_calculus/arithmetic.treetop

Clifford H…

Christian S. wrote:

You may want to look at: http://kanocc.rubyforge.org/

Actually, I think Citrus does a better job than Treetop (of which I’m
primary maintainer and a serious user). It provides a BNF built on ruby
functions and operators, and also a Treetop/PEG-like parser for grammar
files as an alternative. A brief look at Kanocc indicates Citrus may be
preferable over it also.

Citrus doesn’t memoize by default however, you have to enable that. It
also doesn’t have support for semantic predicates, an important feature
(for me) which I added to Treetop. Otherwise it is preferable in most
ways, and I think Michael J. has done a sterling job. Kudos!

http://github.com/mjijackson/citrus/

Clifford H…

You may want to look at: http://kanocc.rubyforge.org/

http://kanocc.rubyforge.org/</shameless plug>

best regards

Christian S.

2010/8/17 Clifford H. [email protected]

Clifford H. wrote:

Yes. PEG parsers are greedy (iterate as far as possible and never
backtrack on that), and also take the first alternative that succeeds,
without trying subsequent alternatives. It’s actually quite useful
and natural, once you get the hang of it.

Pure PEGs have no lookahead, as Treetop does. It’s often necessary
though, in the light of the above.

What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.
That seems to rule out PEG parsers.

Clifford H. wrote:

Christian S. wrote:

You may want to look at: http://kanocc.rubyforge.org/

Actually, I think Citrus does a better job than Treetop (of which I’m primary maintainer and a serious user). It provides a BNF built on ruby functions and operators, and also a Treetop/PEG-like parser for grammar files as an alternative. A brief look at Kanocc indicates Citrus may be preferable over it also.

Citrus doesn’t memoize by default however, you have to enable that. It also doesn’t have support for semantic predicates, an important feature (for me) which I added to Treetop.

http://github.com/mjijackson/citrus/

Thanks a lot for the pointers Clifford and Christian!
Will have a look as soon as time permits.

Philipp

Philipp K. wrote:

Clifford H. wrote:
What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.

Most of the BNFs in RFCs cannot be directly automated by any parser
generator. In addition, quite a few of them are slightly incorrect
and/or incomplete, but haven’t been fixed because they’re always
implemented by humans, who paper over the issues, and introduce new
errors instead :wink:

The upshot is that you’re not going to find a direct implementation
that works without thinking.

That seems to rule out PEG parsers.

On the contrary, I believe that a PEG parser might represent the most
direct mapping. PEGs do backtrack, just not on a successful rule or
iteration. You simply have to introduce enough lookahead to prevent
false success or excessive iteration… which is actually pretty easy
once you get the hang of it… the point here is that the rest of
the grammar is much more likely to look like the EBNF you started
with. Oh, yeah, you have to refactor any left-recursion too, which can
be difficult.

Clifford H…

Clifford H. wrote:

Philipp K. wrote:

Clifford H. wrote:
What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.
Most of the BNFs in RFCs cannot be directly automated by any parser
generator. In addition, quite a few of them are slightly incorrect
and/or incomplete, but haven’t been fixed because they’re always
implemented by humans, who paper over the issues, and introduce new
errors instead :wink:

The upshot is that you’re not going to find a direct implementation
that works without thinking.

That seems to rule out PEG parsers.
On the contrary, I believe that a PEG parser might represent the most
direct mapping. PEGs do backtrack, just not on a successful rule or
iteration. You simply have to introduce enough lookahead to prevent
false success or excessive iteration… which is actually pretty easy
once you get the hang of it… the point here is that the rest of
the grammar is much more likely to look like the EBNF you started
with. Oh, yeah, you have to refactor any left-recursion too, which can
be difficult.

Actually, while it is true that PEGs as specificied in Bryan Ford’s
paper do not support left recursion, PEGs are usually implemented
using packrat parsers which can support direct left recursion just
fine as Alessandro Warth showed in his aptly titled paper “Packrat
Parsers Can Support Left Recursion”, although with the caveat that you
lose the linear time complexity guarantee that makes packrat parsing
so compelling in the first place.

jwm

Jörg W Mittag wrote:

Actually, while it is true that PEGs as specificied in Bryan Ford’s
paper do not support left recursion, PEGs are usually implemented
using packrat parsers which can support direct left recursion just
fine as Alessandro Warth showed in his aptly titled paper “Packrat
Parsers Can Support Left Recursion”,

Nice, thanks for the pointer. I’m not brave enough to try to implement
it in Treetop (the builder used for code gen is a bit of a pain), but
it’d certainly be cool to have. Probably easier to add to Citrus, where
there’s no code generator in the way.

Clifford H…