Dhaka-2.1.0 : Fun with lexers

require ‘dhaka’
class MyLexerSpec < Dhaka::LexerSpecification
for_pattern(‘\w+’) { puts “word #{current_lexeme.value}” }
for_pattern(‘\s+’) { puts “whitespace” }
for_pattern(‘*’) { puts “bullet point” }
for_pattern(‘(\.|,|'|-|(|)|:)’) do
puts “punctuation #{current_lexeme.value}”
end
for_pattern(‘http://\w+(.\w+)+(/(\w|_)+)*(.\w+)?’) do
puts “url #{current_lexeme.value}”
end
end

Dhaka::Lexer.new(MyLexerSpec).lex("

This release of Dhaka adds the ability to generate lexers from a
specification such as the above.

Lexer generation works just like parser generation in Dhaka - you
write a spec, generate a lexer, then ‘compile’ it to Ruby. The
cool thing about this is that although it looks like a
straightforward application of Ruby regexes, it’s not. Under the
covers, the regex engine is Dhaka’s own, implemented in Ruby using
Dhaka itself. Having full control of the engine makes certain
wonderful things possible, but a disadvantage is that not all of
Ruby’s awesome regex operators are supported. We’re going for a
subset of the Ruby regex language - you can see the current
grammar at http://dhaka.rubyforge.org/regex_grammar.html
(apologies if this looks not-so-human-readable). In future
versions, we’ll be implementing assertions, curly-brace quantifier
expressions and the lookahead operator (the symbols for these have
been reserved and must be escaped).

The examples on the homepage have been updated to use the lexer
generator where appropriate (the hand-written tokenizer is still
a nice quick-and-dirty solution in some cases).

Other changes:

  • We’ve added the ability to specify parser actions (code blocks
    that are invoked when a particular syntactic construct is
    recognized). This feature isn’t documented yet, but it’s available
    for the adventurous and something will be written up about it soon.
  • Compiled parsers are much smaller than they used to be.
  • A couple of bugs relating to the use of escape characters in
    grammar symbol names were fixed.

This release is backwards-compatible. Please let me know if you
find out otherwise.

http://dhaka.rubyforge.org

Mushfeq.

").each {|tok|}

Hi,

this looks really good. I’m not really into parser generators but I’m
looking to jump into since many months now. One question that comes to
my
mind, is the question of reusability. Is it possible to combine parsers
in
some way ?

Some formats like URl/URL or IPv4/IPv6 are defined over and over again
and
contained in many other formats. Would it be possible to define one
implementation and reuse them in the subsequent formats ?

Hi,

Dhaka doesn’t support combining parsers themselves. However, you can
always
refactor the code that generates the parsers (it’s just Ruby code after
all). For example, the declaration of the syntactic elements of a URL
would
be ruby code that is executed in the context of a subclass of
Dhaka::Grammar. You can refactor this code out into a method or even a
module. I’ll see if I can come up with an example page showing how this
would be done.

As for combining parsers (i.e. the objects that do the parsing), there
are
tools called parser combinators which accomplish that. Examples: Parsec
(Haskell), JParsec (Java), RParsec (Ruby). I haven’t used any of these,
so I
can’t really say any more about them. Here’s a nice blog entry by Gilad
Bracha (Smalltalk and Java luminary) on the idea behind them:

I hope this helps.

Mushfeq.

2007/3/11, Mushfeq K. [email protected]:

module. I’ll see if I can come up with an example page showing how this
would be done.

Yeah, if I understand right, the actions are named, and executed by the
lexer/parser in a specific context. So you just have to override/extend
an
action to plug your other parser.

As for combining parsers (i.e. the objects that do the parsing), there
are

tools called parser combinators which accomplish that. Examples: Parsec
(Haskell), JParsec (Java), RParsec (Ruby). I haven’t used any of these, so
I
can’t really say any more about them. Here’s a nice blog entry by Gilad
Bracha (Smalltalk and Java luminary) on the idea behind them:

Room 101: Parser Combinators

I hope this helps.

Thanks a lot for that interesting read :slight_smile: