Scan for Tokens

I am looking for the best way to break an input string into individual
tokens (I do not want to use a lexer library); I found some Ruby
programs that do it by “nibbling” at the string, like this (for
simplicity, the tokens are simply printed):
str = “20 * sin(x) + …”

while (s.length > 0)
if str.sub!(\A\s*(\d+)/) { |m| puts “nr: #{m}” ; ‘’ }
elsif str.sub!(\A\s*(\w+)/) { |m| puts “func: #{m}” ; ‘’ }

This works, but it is very inefficient as the string has to be
continuously modified (a variation is to use str.match and then set str
= post_match, that is
probably even worse).
I was looking for the equivalent of what Perl calls “walking the string”
(if $str =~ /\G …/gcxms), picking up one token at the time at the point
after the previous one was retrieved.

I saw in the Pickaxe the mention of \G with scan; but I could not make
scan work ‘one token at the time’; I had to list all the tokens as
argument, and then I had to find out which token had hit, ie:

str.scan(/\G\s* (\d+ | []| [+] | [(] | …)/xm) do |m|
if m[0].match(/A\d+\z/) then puts “number: #{m}”
elsif m[0].match(/A[
]\z/) then puts “power: #{m}”

It worked perfectly (almost to my surprise!); but it seems funny (unRuby
like) to have to repeat the tokens (even if in my real code I used
regexp vars to avoid hardcoding them twice, it still is a repetition).

I looked at 4 Ruby books and I found only platitudes on the subject (or
references to libraries). I would love to hear an elegant way to solve
this,

thanks!

Raul

On Nov 10, 6:07 pm, Raul P. [email protected] wrote:

I am looking for the best way to break an input string into individual
tokens (I do not want to use a lexer library)

Look at the StringScanner library[1] included with Ruby. It’s simple,
and it’s fast. It’s the basis of my TagTreeScanner library[2], which
is specialized for parsing arbitrary text and converting it into
hierarchically nested markup (e.g. XML).

[1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
[2] http://phrogz.net/RubyLibs/OWLScribble/doc/tts.html

Gavin K. wrote:

On Nov 10, 6:07 pm, Raul P. [email protected] wrote:

I am looking for the best way to break an input string into individual
tokens (I do not want to use a lexer library)

Look at the StringScanner library[1] included with Ruby. It’s simple,
and it’s fast. It’s the basis of my TagTreeScanner library[2], which
is specialized for parsing arbitrary text and converting it into
hierarchically nested markup (e.g. XML).

[1] http://ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
[2] RDoc Documentation

Gavin

I was surprised at first that this basic capability was in a library,
but
StringScanner works beautifully, and it is indeed extremely fast.

I will try your TagTreeScanner at the first chance

Thank you

Raul