Bytecode Compiler (#100)

I’m very glad this quiz came along, because I honestly had no idea what
“bytecode” meant before I read this quiz. I think it was good for
dumbing the
idea down for the masses (or just me, if everyone else already knew this
stuff).
Not all of us port entire virtual machines, as the quiz creator does.

There were a couple of approaches to this week’s quiz. One strategy was
to
convert the expression into valid Ruby that would emit the bytecode and
let Ruby
do the parsing. This works for this parsing problem because Ruby’s
expression
rules are comparable to those specified by the quiz. I think we can
break down
one of those solutions to better understand what was involved with this
problem.

Here’s a chunk of code from Cameron Pope’s solution:

module CreateExpressions
  def +(other) Expr.new(:add, self, other) end
  def -(other) Expr.new(:sub, self, other) end
  def *(other) Expr.new(:mul, self, other) end
  def /(other) Expr.new(:div, self, other) end
  def %(other) Expr.new(:mod, self, other) end
  def **(other) Expr.new(:pow, self, other) end
end

You can see a set of operator definitions here, matching those needed by
the
quiz. Instead of these methods doing what they normally do (math), the
versions
we see here just create and return an Expr object. Let’s have a look at
that
definition now:

class Expr
  include CreateExpressions
  OPCODES = {:add => 0x0a, :sub => 0x0b, :mul => 0x0c, :pow => 0x0d,
    :div => 0x0e, :mod => 0x0f}

  def initialize(op, a, b)
    @op = op
    @first = a
    @second = b
  end

  def to_s
    "(#{@op.to_s} #{@first.to_s} #{@second.to_s})"
  end

  def emit
    @first.emit << @second.emit << OPCODES[@op]
  end
end

Ignoring the to_s() method, which is just for debugging, this object is
trivial.
It stores an operator and operands, and can emit() bytecode. To do
that, it
forwards the emit() call to both operands, which may be nested Expr
objects (say
from parenthetical expressions) or something else we will discuss in
just a
moment. emit() also looks up the operator’s bytecode and appends that
to the
results of the operand emit()s.

Note that this class include()s all of the operator definitions we saw
earlier.
This means you can add, multiply, etc. Expr objects, which yields a new
Expr
with the original Epxrs nested in the operands.

That covers expressions, but what about plain old numbers? For that,
there is
another class:

class Const
  include CreateExpressions
  OPCODES = {2 => 0x01, 4 => 0x02}

  def initialize(i)
    @value = i
  end

  def to_s
    @value
  end

  def emit
    case @value
      when (-32768..32767): bytes = [@value].pack("n").unpack("C*")
      else bytes = [@value].pack("N").unpack("C*")
    end
    bytes.insert 0, OPCODES[bytes.size]
  end
end

We have pretty much the same pattern here, save that emit() converts the
number
to the bytecode for the properly sized constant plus the packed bytes.
Again we
see the arithmetic operators include()ed, so these too can be combined
with
other Const and Expr objects.

Now we need a means to turn the normal numbers of the expression into
Const
objects:

class Fixnum
  def to_const
    Const.new(self)
  end
end

Yep, that will do just that.

All that’s left is to put it to use:

class Compiler
  def self.compile(expr)
    self.mangle(expr).emit.flatten
  end

  def self.explain(expr)
    self.mangle(expr).to_s
  end

private
  def self.mangle(expr)
    eval(expr.gsub(/\d+/) {|s| "#{s}.to_const()"})
  end
end

The mangle() method is the heart of this solution. A simple Regexp is
used to
tack a to_const() call on to each number of the expression and a
hand-off is
made to eval() to build up the indicated combination of Const and Expr
objects.
Once mangle()d, the result is emit()ted and flatten()ed into the final
bytecode
Array in compile().

That’s easy enough to understand and it works just fine in this
instance.
However, what if the rules differed from Ruby’s and you couldn’t lean on
Ruby’s
parser? In that case, you would have to roll your own parser, which
some
decided to do.

Luckily there is already a popular algorithm for unrolling infix
arithmetic
expressions into the RPN order required by bytecode spec:

http://en.wikipedia.org/wiki/Shunting_yard_algorithm

Several people employed this strategy, including Daniel M.:

# This is a solution to Ruby Q. #100
#
# It's basically just a shunting algorithm, but with a twist
# since it needs to distinguish between a "-" that's part of
# a number and a "-" that's an operator.  To do that, I use
# a state machine while parsing to remember if I need next
# an operator or an integer.

require 'strscan'
class Compiler
  # A small class made so that I can use case ... when
  # with a StringScanner
  class Token < Regexp
    def initialize(re)
      super(re)
    end
    # Using is_a? instead of respond_to? isn't very duck-typey,
    # but unfortunately String#scan and StringScanner#scan mean
    # completely different things.
    def ===(s)
      if (s.is_a?(StringScanner))
        s.scan(self)
      else
        super(s)
      end
    end
  end

  # ...

This first class of Daniel’s solution is really just a Regexp with a
custom case
equals method. This sets up an elegant syntax for the solution proper
we will
encounter in a bit.

  # ...

  # The tokens I need
  WSPACE = Token.new(/\s+/)
  LPAREN = Token.new(/\(/)
  RPAREN = Token.new(/\)/)
  OP  = Token.new(/\*\*|[+*%\/-]/)
  NEG = Token.new(/-/)
  INT = Token.new(/\d+/)

  OpValMap = {'+' => 0x0a, '-' => 0x0b, '*' => 0x0c,
              '**' => 0x0d, '/' => 0x0e, '%' => 0x0f}

  # ...

Here you see the class used to build the Tokens for lexxing the
expression
content. These are very basic regular expressions.

Here are the interface methods required by the quiz solution:

  # ...

  def initialize(instring)
    @scanner = StringScanner.new(instring)
    @opstack = Array.new
    @outarr = Array.new
  end

  def compile()
    state = :state_int
    while state != :state_end
      case @scanner
      when WSPACE
        next
      else
        state = send(state)
        raise "Syntax error at index #{@scanner.pos}" if ! state
      end
    end
    while ! @opstack.empty?
      op = @opstack.pop
      raise "Mismatched parens" if LPAREN === op
      @outarr << OpValMap[op]
    end
    @outarr
  end

  # Class method as required by the test harness
  def self.compile(instring)
    new(instring).compile
  end

  # ...

Notice that initialize() sets up the stack and output Arrays needed by
the
Shunting Yard algorithm. The compile() method wraps a state machine we
will
examine the workings of shortly. You can see that is discards
whitespace as
needed, forwards to state methods, pop()s the final operators as the
algorithm
requires, and returns the resulting output Array. The class compile()
is just a
wrapper over the other two methods.

The heart of what is going on here is hidden in the state methods. From
compile() we saw that the initial state for the machine is :state_int,
so let’s
begin with that method:

  # ...

  private

  # ...

  # state where we're expecting an integer or left paren
  def state_int
    case @scanner
    when LPAREN
      @opstack << @scanner.matched
      :state_int
    when INT
      integer(@scanner.matched.to_i)
      :state_op
    when NEG
      :state_neg
    end
  end

  # ...

Here we begin to see the magic of the Token class. Matching a Token
advances
the StringScanner index and matched() can then be used to grab the
element.

The LPAREN when clause is right out of the algorithm description, and
the INT
clause is pretty obviously if you know that integer() handles the
constant
conversion. The NEG clause is needed to distinguish an unary minus from
the
binary operator. Note that each case returns the next expected state
for the
machine.

Here’s the integer() helper we have seen before:

  # ...

  # Handle an integer
  def integer(i)
    if (i <= 32767 and i >= -32768)
      @outarr << 0x01
      @outarr.push(*([i].pack("n").unpack("C*")))
    else
      @outarr << 0x02
      @outarr.push(*([i].pack("N").unpack("C*")))
    end
  end

  # ...

The only difference here is that constants are pushed onto the output
Array as
required by the algorithm.

Let’s return to the state methods:

  # ...

  # Expecting an operator or right paren
  def state_op
    case @scanner
    when RPAREN
      while not LPAREN === @opstack[-1]
        raise "Mismatched parens" if @opstack.empty?
        @outarr << OpValMap[@opstack.pop]
      end
      @opstack.pop
      :state_op
    when OP
      op = @scanner.matched
      while is_lower(@opstack[-1], op)
        @outarr << OpValMap[@opstack.pop]
      end
      @opstack << op
      :state_int
    else
      # I would handle this with an EOS token, but unfortunately
      # StringScanner is broken w.r.t. @scanner.scan(/$/)
      :state_end if @scanner.eos?
    end
  end

  # ...

Again, the RPAREN clause is right out of the algorithm description. The
OP
clause is as well and you can see that it handles the precedence check
via the
is_lower() helper method. The else clause gives us our exit state when
the
expression has been exhausted.

Here’s the is_lower() helper:

  # ...

  # Define the precedence order
  # One thing to note is that for an operator a,
  # is_lower(a,a) being true will make that operator
  # left-associative, while is_lower(a,a) being false
  # makes that operator right-associative.  Note that
  # we want ** to be right associative, but all other
  # operators to be left associative.
  def is_lower(op_on_stack, op_in_hand)
    case op_on_stack
      when nil, LPAREN; false
      when /\*\*|[*\/%]/; op_in_hand =~ /^.$/
      when /[+-]/;        op_in_hand =~ /[+-]/
    end
  end

  # ...

The comment surely explains this better than I can, but the point of
this method
is to resolve whether or not the operator we just matched is lower in
precedence
than the operator on the stack. For example, in the last line, when we
have a
plus or minus on the stack only another plus or minus would trigger the
true
result.

Here’s the final state:

  # ...

  # The state where we've seen a minus and are expecting
  # the rest of the integer
  def state_neg
    case @scanner
    when INT
      integer(-(@scanner.matched.to_i))
      :state_op
    end
  end

  # ...
end

This just reads the constant following a negation operator and ensures
that it
is negated.

Those are two different approaches that pass the quiz tests. It’s
probably
slightly easier to lean on Ruby in cases like this where you know you
can get
away with it. If you can’t though, the custom parser isn’t too much
more work
as Daniel shows.

My thanks to all who taught me many things I didn’t know about bytecode
compilation through their solutions and to Ross B. for the
educational
quiz.

Tomorrow we will play with VCRs and I’m told that dates me…