require ‘dhaka’
class MyLexerSpec < Dhaka::LexerSpecification
for_pattern(‘\w+’) { puts “word #{current_lexeme.value}” }
for_pattern(‘\s+’) { puts “whitespace” }
for_pattern(‘*’) { puts “bullet point” }
for_pattern(‘(\.|,|'|-|(|)|:)’) do
puts “punctuation #{current_lexeme.value}”
end
for_pattern(‘http://\w+(.\w+)+(/(\w|_)+)*(.\w+)?’) do
puts “url #{current_lexeme.value}”
end
end
Dhaka::Lexer.new(MyLexerSpec).lex("
This release of Dhaka adds the ability to generate lexers from a
specification such as the above.
Lexer generation works just like parser generation in Dhaka - you
write a spec, generate a lexer, then ‘compile’ it to Ruby. The
cool thing about this is that although it looks like a
straightforward application of Ruby regexes, it’s not. Under the
covers, the regex engine is Dhaka’s own, implemented in Ruby using
Dhaka itself. Having full control of the engine makes certain
wonderful things possible, but a disadvantage is that not all of
Ruby’s awesome regex operators are supported. We’re going for a
subset of the Ruby regex language - you can see the current
grammar at http://dhaka.rubyforge.org/regex_grammar.html
(apologies if this looks not-so-human-readable). In future
versions, we’ll be implementing assertions, curly-brace quantifier
expressions and the lookahead operator (the symbols for these have
been reserved and must be escaped).
The examples on the homepage have been updated to use the lexer
generator where appropriate (the hand-written tokenizer is still
a nice quick-and-dirty solution in some cases).
Other changes:
- We’ve added the ability to specify parser actions (code blocks
that are invoked when a particular syntactic construct is
recognized). This feature isn’t documented yet, but it’s available
for the adventurous and something will be written up about it soon. - Compiled parsers are much smaller than they used to be.
- A couple of bugs relating to the use of escape characters in
grammar symbol names were fixed.
This release is backwards-compatible. Please let me know if you
find out otherwise.
Mushfeq.
").each {|tok|}