Bytecode Compiler (#100)

The three rules of Ruby Q.:

  1. Please do not post any solutions or spoiler discussion for this quiz
    until
    48 hours have passed from the time on this message.

  2. Support Ruby Q. by submitting ideas as often as you can:

http://www.rubyquiz.com/

  1. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby T. follow the discussion. Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Ross B.

Note: This quiz isn’t really as much work as it might seem!

This quiz involves writing (in Ruby, of course) a compiler for basic
arithmetic
expressions. The output from this compiler should be an array of
unsigned
byte-sized ints, which can be fed into the included interpreter
(http://www.rubyquiz.com/interp.rb) in order to execute the compiled
expression.

The bytecode format is very simple, while the interpreter is also very
simple,
implemented as a stack machine. They have the following general
characteristics:

* Bytecode is stored as an array of unsigned byte-sized Fixnums.
* All stack-bound numbers are signed integers
* The following operations are supported:
  * Addition
  * Subtraction
  * Multiplication
  * Division
  * Raise to power
  * Integer modulo
* Where an operator would return a floating point value,
  the value is truncated to an integer.
* Short CONST and long LCONST instructions allow constants
  to be pushed to the stack. These instructions expect their
  operands to hold a signed short or long, respectively,
  in network byte order.

Your compiler interface should be via a singleton method on a module
‘Compiler’,
taking a string, such that:

Compiler.compile('3+2')

Returns an array of instructions (and operands) that, when fed to the
interpreter, will execute the expression 3+2 and return the result
(hopefully
5). For example, a correct (but non-optimal) result from the above call
would
be:

[2,0,0,0,3,   # LCONST 3
 2,0,0,0,2,   # LCONST 2
 10]          # ADD

Your compiler should support all basic arithmetic operations and
explicit
precedence (parenthesis). As standard, syntax/precedence/
associativity/etc.
should follow Ruby itself. Obviously, specific implementation is
entirely up to
you, though bear in mind that your compiler must be capable of running
inline in
the same Ruby process as the interpreter, without affecting any code
outside
itself.

The quiz also includes a number of tests
(http://www.rubyquiz.com/test_bytecode.rb) that will test your
compiler’s
functionality, with expressions becoming more complex as the tests go
on. To
pass all the tests, a compiler will have to not only generate correct
bytecode,
but it will also need to generate the shortest code it can for a given
expression.

Here is the bytecode spec:

# 0x01: CONST (cbyte1, cbyte2) ... => ..., const
  Push a 15-bit signed integer to the stack.
  The two bytes immediately following the instruction represent the
  constant.

# 0x02: LCONST (cbyte1, cbyte2, cbyte3, cbyte4) ... => ..., const
  Push a 31-bit signed integer to the stack.
  The four bytes immediately following the instruction represent the
  constant.

# 0x0a: ADD () ..., value1, value2 => ..., result
  Pop the top two values from the stack, add them, and push the result
  back onto the stack.

# 0x0b: SUB () ..., value1, value2 => ..., result
  Pop the top two values from the stack, subtract value2 from value1,
  and push the result back onto the stack.

# 0x0c: MUL () ..., value1, value2 => ..., result
  Pop the top two values from the stack, multiply value1 by value2,
  and push the result back onto the stack.

# 0x0d: POW () ..., value1, value2 => ..., result
  Pop the top two values from the stack, raise value1 to the power of
  value2, and push the result back onto the stack.

# 0x0e: DIV () ..., value1, value2 => ..., result
  Pop the top two values from the stack, divide value1 by value2,
  and push the result back onto the stack.

# 0x0f: MOD () ..., value1, value2 => ..., result
  Pop the top two values from the stack, modulo value1 by value2,
  and push the result back onto the stack.

# 0xa0: SWAP () ..., value1, value2 => ..., value2, value1
  Swap the top two stack values.

require ‘interp’

module Compiler

use eval and Value class below to compile

expression into bytecode

def Compiler.compile(s)
s.gsub!(/([0-9]+)/, ‘Value.new(stack, \1)’)
stack = []
eval(s)
stack
end

class Value
attr_reader :number # constant value or nil for on stack
ON_STACK = nil

 def initialize(stack, number)
   @number = number
   @stack = stack
 end

 # generate code for each binary operator (except -@)
 # algorithm:
 # push constants (or don't if already on stack)
 # swap if necessary
 # push bytecode
 # create stack item
 {'+' => Interpreter::Ops::ADD,
  '-' => Interpreter::Ops::SUB,
  '*' => Interpreter::Ops::MUL,
  '**'=> Interpreter::Ops::POW,
  '/' => Interpreter::Ops::DIV,
  '%' => Interpreter::Ops::MOD}.each do |operator, byte_code|
    Value.module_eval <<-FUNC
     def #{operator}(rhs)
       push_const(@number)
       push_const(rhs.number)
       # may need to swap integers on stack for all but plus
       #{
         if operator != "+"
           "@stack << Interpreter::Ops::SWAP if rhs.number == nil &&
                                                @number != nil"
         end
       }
       @stack << #{byte_code}
       Value.new(@stack, ON_STACK)
     end
    FUNC
 end

 def -@
   if @number != ON_STACK
     @number = -@number
     push_const(@number)
   else
     push_const(@number)
     push_const(0)
     @stack << Interpreter::Ops::SWAP
     @stack << Interpreter::Ops::SUB
   end
   Value.new(@stack, ON_STACK)
 end

 def push_const(number)
   if number != ON_STACK
     if (-32768..32767).include?(number)
       @stack << Interpreter::Ops::CONST
     else
       @stack << Interpreter::Ops::LCONST
       @stack << ((number >> 24) & 0xff)
       @stack << ((number >> 16) & 0xff)
     end
     @stack << ((number >> 8) & 0xff)
     @stack << (number & 0xff)
   end
 end

end
end

On 11/3/06, Ruby Q. [email protected] wrote:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Ross B.

Note: This quiz isn’t really as much work as it might seem!

This quiz involves writing (in Ruby, of course) a compiler for basic arithmetic
expressions. The output from this compiler should be an array of unsigned
byte-sized ints, which can be fed into the included interpreter
(http://www.rubyquiz.com/interp.rb) in order to execute the compiled expression.

Well… this took me longer than it probably should have.
I didn’t find out about this:

…until I was pretty much done anyway.

Also, helpful tip I’d like to send back to my past self… Regular
expressions aren’t powerful enough to find matched parentheses.

Anyway, thanks for the quiz. Fun stuff.

class Compiler
def self.compile(input)
@bytecodes = {‘+’ => 0x0a, ‘-’ => 0x0b, ‘*’ => 0x0c, ‘**’ => 0x0d,
‘/’ => 0x0e, ‘%’ => 0x0f}
encode postfix(input)
end

def self.encode(tokens)
tokens.collect do |token|
number = token =~ /-?\d+/ ? token.to_i : nil
if (-32768…32767).include?(number)
[0x01] + [number].pack(‘n’).split(//).collect {|byte| byte[0]}
elsif !number.nil? # long
[0x02] + [number].pack(‘N’).split(//).collect {|byte| byte[0]}
else
@bytecodes[token]
end
end.flatten
end

def self.postfix(infix)
stack, stream, last, op = [], [], nil, nil
tokens = infix.scan(/\d+|**|[-+*/()%]/)
tokens.each_with_index do |token,i|
case token
when /\d+/; stream << token
when *@bytecodes.keys
if token == ‘-’ and last.nil? || (last =~ /\D/ && tokens[i+1] =~
/\d/)
tokens[i+1] = “-#{tokens[i+1]}”
else
stream << stack.pop while stack.any? && preceded?(stack.last,
token)
stack << token
end
when ‘(’; stack << token
when ‘)’; stream << op while (op = stack.pop) && (op != ‘(’)
end
last = token
end
stream << op while op = stack.pop
stream
end

def self.preceded?(last, current)
ops = {‘+’ => 1, ‘-’ => 1, ‘%’ => 2, ‘/’ => 2, ‘*’ => 2, ‘**’ =>
3, ‘(’ => 0, ‘)’ => 0}
ops[last] >= ops[current] rescue true
end
end

The simplest way I found to do this problem was to let Ruby do the
legwork of parsing the expression for me, so I didn’t have to worry
about things like parens or operator precedence.

I defined a Const and an Expr class and defined operators inside of
them to create a parse tree, added a to_const method to Fixum and
created a regular expression to convert an expression 1+1 to
‘1.to_const() + 1.to_const()’, so evaluating that expression would
produce a parse tree for that experssion.

Emitting the bytescodes is then a post-order traversal of the parse
tree.

---- Compiler.rb

Operator overrides to create an expression tree. Mixed into

Const and Expr so:

Const Const => Expr

Const Expr => Expr

Expr Const => Expr

module CreateExpressions
def +(other) Expr.new(:add, self, other) end
def -(other) Expr.new(:sub, self, other) end
def *(other) Expr.new(:mul, self, other) end
def /(other) Expr.new(:div, self, other) end
def %(other) Expr.new(:mod, self, other) end
def **(other) Expr.new(:pow, self, other) end
end

Add a method to fixnum to create a const from an integer

class Fixnum
def to_const
Const.new(self)
end
end

An integer value

class Const
include CreateExpressions

Opcodes to push shorts or longs respectively onto the stack

OPCODES = {2 => 0x01, 4 => 0x02}

def initialize(i)
@value = i
end

def to_s
@value
end

Emits the bytecodes to push a constant on the stack

def emit
# Get the bytes in network byte order
case @value
when (-32768…32767): bytes = [@value].pack(“n”).unpack(“C*”)
else bytes = [@value].pack(“N”).unpack(“C*”)
end
bytes.insert 0, OPCODES[bytes.size]
end
end

A binary expression

class Expr
include CreateExpressions
OPCODES = {:add => 0x0a, :sub => 0x0b, :mul => 0x0c, :pow => 0x0d,
:div => 0x0e, :mod => 0x0f}

def initialize(op, a, b)
@op = op
@first = a
@second = b
end

Emits a human-readable s-expression for testing

(preorder traversal of parse tree)

def to_s
“(#{@op.to_s} #{@first.to_s} #{@second.to_s})”
end

Bytecode emitter for an expression (postorder traversal of parse

tree)
def emit
# emit LHS, RHS, opcode
@first.emit << @second.emit << OPCODES[@op]
end
end

Compile and print out parse tree for expressions

class Compiler

Creates bytecodes from an arithmatic expression

def self.compile(expr)
self.mangle(expr).emit.flatten
end

Prints a representation of the parse tree as an S-Expression

def self.explain(expr)
self.mangle(expr).to_s
end

private

Name-mangles an expression so we create a parse tree when calling

Kernel#eval instead of evaluating the expression:

[number] => [number].to_const()

def self.mangle(expr)
eval(expr.gsub(/\d+/) {|s| “#{s}.to_const()”})
end
end

On Nov 5, 2006, at 9:43 AM, Wilson B. wrote:

Also, helpful tip I’d like to send back to my past self… Regular
expressions aren’t powerful enough to find matched parentheses.

Perl’s regex engine can do this, as can Ruby 1.9’s Oniguruma engine.
Both allow recursive definitions, which is what it takes.

James Edward G. II

On Nov 5, 2006, at 11:18 AM, Dennis Ranke wrote:

Here is my solution, a simple recursive descent parser. It’s a bit
more code than is strictly necessary because it is loosely modeled
after a parser for a real language I have written recently.

Did you write that language in Ruby? I’m just curious.

James Edward G. II

James Edward G. II wrote:

On Nov 5, 2006, at 11:18 AM, Dennis Ranke wrote:

Here is my solution, a simple recursive descent parser. It’s a bit
more code than is strictly necessary because it is loosely modeled
after a parser for a real language I have written recently.

Did you write that language in Ruby? I’m just curious.

I wrote the (fairly simple) compiler in Ruby and the bytecode
interpreter in C++. The whole thing is used as a scripting language for
a Nintendo DS game.

Here is my solution, a simple recursive descent parser. It’s a bit more
code than is strictly necessary because it is loosely modeled after a
parser for a real language I have written recently.

require ‘English’

class Compiler
def self.compile(expr)
return self.new(expr).compile.unpack(‘C*’)
end

def initialize(expr)
# a very simple tokenizer
@tok = []
until expr.empty?
case expr
when /\A\s+/ # skip whitespace
# don’t tokenize ‘1-1’ as ‘1’, ‘-1’
when (@tok.last.is_a? Integer) ? /\A\d+/ : /\A-?\d+/
@tok << $MATCH.to_i
# any other character and ‘**’ are literal tokens
when /\A**|./
@tok << $MATCH
end
expr = $POSTMATCH
end
end

def compile
code = compile_expr(0)
raise “syntax error” unless @tok.empty?
return code
end

private

OPS = {’+’=>0xa, ‘-’=>0xb, ‘*’=>0xc, ‘**’=>0xd, ‘/’=>0xe, ‘%’=>0xf}

def compile_expr(level)
# get the tokens to parse at this precedence level
tok = [[’+’, ‘-’], [’*’, ‘/’, ‘%’], [’’]][level]
if tok
# if we are to actually parse a bi-op, do so
left = compile_expr(level + 1)
# for left-associative ops, find as many ops in a row as possible
while tok.include?(@tok.first)
op = OPS[@tok.shift]
# '
’ is right-associative, so add a special case for that
right = compile_expr(op == OPS[’**’] ? level : level + 1)
left << right + op.chr
end
return left
end
# if we are at a level higher than the ops, try to parse an
# atomic - either a numeral or an expression in paranthesis
tok = @tok.shift
if tok == ‘(’
expr = compile_expr(0)
raise “’)’ expected” unless @tok.shift == ‘)’
return expr
end
raise ‘number expected’ unless tok.is_a? Integer
return (tok < -32768 || tok > 32767) ? [2, tok].pack(‘CN’) :
[1, tok].pack(‘Cn’)
end
end

if $0 == FILE
p Compiler.compile(ARGV[0])
end

This is my first RubyQuiz submission. Like Cameron Pope, I decdied to
let
Ruby’s parser do the heavy lifting by monkey-patching a to_expr onto
Integer
and running a regex on Compiler::compile’s input.

This differes from his entry by using method_missing to handle all
operators, and I didn’t bother with separate number/expression classes.

Also, his byte conversion looks much shorter – I’ll have to see how it
works and wrap my head around Array#pack.

I’ve dabbled in Ruby for a while, but I still feel like I’m at the
hello_world stage; constructive criticism is quite welcome.

Compiler.rb

class Integer

Much easier than properly parsing a string for Exper.new(???)

If I really wanted to avoid monkey-patching, I could define

Expr#-@ and expr#+@ instead.

def to_expr
Compiler::Expr.new self
end
end

module Compiler
CONST = 215
LCONST = 2
31

def Compiler.compile input
# I initially tried Expr.new(\1), but this way lets me use Ruby’s
# sign-parsing.
m = input. gsub /(\d+)(\D*)/, ‘\1.to_expr()\2’
exp = eval m
exp.compile
end

The meat of this module…

Rather than do any parsing, I’m just converting all the expression’s

numbers

into Expr objects, whose ±/*, etc. methods (all method_missing)

just
build

a parse-tree when I run eval.

class Expr
attr_reader :val

OPERATORS = { :+ => 0x0a,
            :- => 0x0b,
            :* => 0x0c,
            :** => 0x0d,
            :/ => 0x0e,
            :% => 0x0f,
            :swap => 0xa0 } # Swap doesn't have an operator, but

whatever.

def initialize *v
  @val = v
end

# Take care of all those operators
def method_missing sym, *args
  if OPERATORS.include? sym
    Expr.new [val, args.first, sym]
  else
    raise "Unknown operator: #{sym}, #{args.inspect}"
  end
end

def to_s
  "Expr: <#{flatten.join ' '}>"
end

def flatten
  # Flatten the array as much as we can, then tackle any Expr 

objects.
# Finally, make sure the result is also flat
# (because the map turns each Expr into an array)
val.flatten.map do |i|
if i.respond_to? :flatten
i.flatten
else
i
end
end.flatten
end

def compile
  # Get a flat copy of our value, then encode each number and 

symbol.
# Finally, flatten all the encoded numbers into our answer.
arr = flatten
arr.map do |i|
if i.is_a? Integer
bytes_for(i)
elsif OPERATORS.include? i
OPERATORS[i]
else
# What’s the preferred method of dealing with this?
# I could raise a different exception, or attempt to call the
same
# method in my superclass…
raise “Unknown operator: #{i.inspect}, #{i.class}”
end
end.flatten
end

# Convert a number to bytes
def bytes_for number
  type = size = 0
  values = []
  if number < CONST and number >= -CONST
    type, size = 1, 2
  elsif number < LCONST and number >= -LCONST
    type, size = 2, 4
  else
    raise "#{number} is too big to encode!"
  end

  size.times do |s|
    number, byte = number.divmod 256
    # I could use << here, but then I'd need to reverse values.
    values.unshift byte
  end
  [type, *values]
end

end #expr

end #Compiler

This is a solution to Ruby Q. #100

It’s basically just a shunting algorithm, but with a twist

since it needs to distinguish between a “-” that’s part of

a number and a “-” that’s an operator. To do that, I use

a state machine while parsing to remember if I need next

an operator or an integer.

require ‘strscan’
class Compiler

A small class made so that I can use case … when

with a StringScanner

class Token < Regexp
def initialize(re)
super(re)
end
# Using is_a? instead of respond_to? isn’t very duck-typey,
# but unfortunately String#scan and StringScanner#scan mean
# completely different things.
def ===(s)
if (s.is_a?(StringScanner))
s.scan(self)
else
super(s)
end
end
end

The tokens I need

WSPACE = Token.new(/\s+/)
LPAREN = Token.new(/(/)
RPAREN = Token.new(/)/)
OP = Token.new(/**|[+*%/-]/)
NEG = Token.new(/-/)
INT = Token.new(/\d+/)

OpValMap = {’+’ => 0x0a, ‘-’ => 0x0b, ‘*’ => 0x0c,
‘**’ => 0x0d, ‘/’ => 0x0e, ‘%’ => 0x0f}

def initialize(instring)
@scanner = StringScanner.new(instring)
@opstack = Array.new
@outarr = Array.new
end

def compile()
state = :state_int
while state != :state_end
case @scanner
when WSPACE
next
else
state = send(state)
raise “Syntax error at index #{@scanner.pos}” if ! state
end
end
while ! @opstack.empty?
op = @opstack.pop
raise “Mismatched parens” if LPAREN === op
@outarr << OpValMap[op]
end
@outarr
end

Class method as required by the test harness

def self.compile(instring)
new(instring).compile
end

private

Expecting an operator or right paren

def state_op
case @scanner
when RPAREN
while not LPAREN === @opstack[-1]
raise “Mismatched parens” if @opstack.empty?
@outarr << OpValMap[@opstack.pop]
end
@opstack.pop
:state_op
when OP
op = @scanner.matched
while is_lower(@opstack[-1], op)
@outarr << OpValMap[@opstack.pop]
end
@opstack << op
:state_int
else
# I would handle this with an EOS token, but unfortunately
# StringScanner is broken w.r.t. @scanner.scan(/$/)
:state_end if @scanner.eos?
end
end

state where we’re expecting an integer or left paren

def state_int
case @scanner
when LPAREN
@opstack << @scanner.matched
:state_int
when INT
integer(@scanner.matched.to_i)
:state_op
when NEG
:state_neg
end
end

The state where we’ve seen a minus and are expecting

the rest of the integer

def state_neg
case @scanner
when INT
integer(-(@scanner.matched.to_i))
:state_op
end
end

Handle an integer

def integer(i)
if (i <= 32767 and i >= -32768)
@outarr << 0x01
@outarr.push(([i].pack(“n”).unpack("C")))
else
@outarr << 0x02
@outarr.push(([i].pack(“N”).unpack("C")))
end
end

Define the precedence order

One thing to note is that for an operator a,

is_lower(a,a) being true will make that operator

left-associative, while is_lower(a,a) being false

makes that operator right-associative. Note that

we want ** to be right associative, but all other

operators to be left associative.

def is_lower(op_on_stack, op_in_hand)
case op_on_stack
when nil, LPAREN; false
when /**|[*/%]/; op_in_hand =~ /^.$/
when /[±]/; op_in_hand =~ /[±]/
end
end
end
END

On 11/5/06, Daniel M. [email protected] wrote:

222 was an unfortunate test case to choose, since 24 == 42.
Good catch. Here’s an updated copy of mine that:
A) Steals your cool C* unpack trick. :slight_smile:
B) Gets rid of a temporary variable and a couple of ‘pop’ loops in
exchange for some more golf.
C) Avoids re-initializing the hashes for improved performance.
D) Passes your new test_02a

class Compiler
def self.compile(input)
@bytecodes ||= {‘+’ => 0x0a, ‘-’ => 0x0b, ‘*’ => 0x0c, ‘**’ =>
0x0d, ‘/’ => 0x0e, ‘%’ => 0x0f}
encode postfix(input)
end

def self.encode(tokens)
tokens.collect do |token|
number = token =~ /-?\d+/ ? token.to_i : nil
if (-32768…32767).include?(number)
[0x01] + [number].pack(‘n’).unpack(‘C*’)
elsif !number.nil? # long
[0x02] + [number].pack(‘N’).unpack(‘C*’)
else
@bytecodes[token]
end
end.flatten
end

def self.postfix(infix)
stack, stream, last = [], [], nil
tokens = infix.scan(/\d+|**|[-+*/()%]/)
tokens.each_with_index do |token,i|
case token
when /\d+/; stream << token
when *@bytecodes.keys
if token == ‘-’ and last.nil? || (last =~ /\D/ && tokens[i+1] =~
/\d/)
tokens[i+1] = “-#{tokens[i+1]}”
else
stream << stack.pop while stack.any? && preceded?(stack.last,
token)
stack << token
end
when ‘(’; stack << token
when ‘)’; (stream += stack.slice!(stack.rindex(‘(’),
stack.size).reverse).pop
end
last = token
end
stream += stack.reverse
end

def self.preceded?(last, current)
@ops ||= {‘+’ => 1, ‘-’ => 1, ‘%’ => 2, ‘/’ => 2, ‘*’ => 2, ‘
=> 3, ‘(’ => 0, ‘)’ => 0}
@ops[last] >= @ops[current] && current != '
’ # right associative
mayhem!
end
end

Note that the creator of this quiz left out one important case from
their tests:

def test_02a
assert_equal [212],
Interpreter.new(Compiler.compile(‘212’)).run
end

This tests that your compiler is properly making **
right-associative. Some solutions already posted fail this test.

222 was an unfortunate test case to choose, since 24 == 42.

My solution is a bit different than all the ones I’ve seen so far. I
had no intention of writing an expression parser :slight_smile: All it does is

  1. Define a method (to_bc) to have a fixnum return its own bytecode
  2. Redefine the array operators to return the appropriate bytecode
    representations
  3. Add .to_bc after every number in the expression string

Here it is:

class Compiler
def self.compile(str)

eval(str.gsub(/(\d+)([^\d])/,’\1.to_bc\2’).gsub(/([^\d])(\d+)$/,’\1\2.to_bc’))
end
end

class Fixnum
def to_bc
return (self >= 0 ? [1,self/256,self%256] :
[1,(self+32768)/256+128,(self+32768)%256]) if self <= 32767 and self >=
-32768
res = [(24…30),(16…23),(8…15),(0…7)].map { |range| range.map {
|x| self[x] } }.map { |byte| byte.inject_with_index(0) { |s,x,i|
s+x*2**i } }
([2] << (self > 0 ? res[0] : res[0]+128) << res[1…3]).flatten
end
end

class Array
{:+ => 10, :- => 11, :* => 12, :** => 13, :confused: => 14, :% => 15}.each do
|op,opcode|
define_method(op) { |x| self.concat(x).concat([opcode]) }
end
def inject_with_index(sum)
each_with_index { |x,i| sum = yield(sum,x,i) }
sum
end
end

On Mon, 2006-11-06 at 11:27 +0900, Daniel M. wrote:

Note that the creator of this quiz left out one important case from
their tests:

def test_02a
assert_equal [212], Interpreter.new(Compiler.compile(‘212’)).run
end

This tests that your compiler is properly making **
right-associative. Some solutions already posted fail this test.

Phew - I expected I’d left out far more than just one important case :).
Seriously though, good catch - thanks. It looks like all the solutions
so far are passing it now.

222 was an unfortunate test case to choose, since 24 == 42.

Ahem. Oops. :slight_smile:

Changed my previous solution to use some of the things others have
used that need to be pounded into my skull like \d in regular
expression and pack/unpack. Also added unary + and cleaned up code.

############
require ‘interp’

module Compiler

compile expression into bytecode array

def Compiler.compile(s)
stack = []
eval(s.gsub(/(\d+)/, ‘Value.new(stack, \1)’))
stack
end

class Value
attr_reader :value # constant value or nil for on stack
ON_STACK = nil

 def initialize(stack, value)
   @stack = stack
   @value = value
 end

 # generate code for each binary operator
 {'+' => Interpreter::Ops::ADD,
  '-' => Interpreter::Ops::SUB,
  '*' => Interpreter::Ops::MUL,
  '**'=> Interpreter::Ops::POW,
  '/' => Interpreter::Ops::DIV,
  '%' => Interpreter::Ops::MOD}.each do |operator, byte_code|
    Value.module_eval <<-OPERATOR_CODE
     def #{operator}(rhs)
       push_if_value(@value)
       push_if_value(rhs.value)
       # swap stack items if necessary
       #{if operator != "+"
           "@stack << Interpreter::Ops::SWAP if rhs.value == nil &&
                                                @value != nil"
         end}
       @stack << #{byte_code}
       Value.new(@stack, ON_STACK)
     end
    OPERATOR_CODE
 end

 def -@
   if @value != ON_STACK
     push_if_value(-@value)
   else
     push_if_value(0)
     @stack << Interpreter::Ops::SWAP << Interpreter::Ops::SUB
   end
   Value.new(@stack, ON_STACK)
 end

 def +@
   push_if_value(@value)
   Value.new(@stack, ON_STACK)
 end

 def push_if_value(value)
   if value != ON_STACK
     if (-32768..32767).include?(value)
       @stack << Interpreter::Ops::CONST
       @stack.concat([value].pack("n").unpack("C*"))
     else
       @stack << Interpreter::Ops::LCONST
       @stack.concat([value].pack("N").unpack("C*"))
     end
   end
 end

end
end

On 03/11/06, Ruby Q. [email protected] wrote:

Note: This quiz isn’t really as much work as it might seem!

Ouch! :slight_smile: Oh well, I don’t think my Ruby is up to some of the more
cunning techniques I see many have employed.

Your compiler should support all basic arithmetic operations and explicit
precedence (parenthesis). As standard, syntax/precedence/ associativity/etc.
should follow Ruby itself.

By associativity does this also mean that “±-+1” - which could be
rewritten as “0+(0-(0-(0+1)))” - should be parsed into a bytecode
which, however well optimised, on execution results in the answer “1”?

I should warn that my solution below is rather long. I went down a
possibly more traditional route of writing a generic tokenizer/lexer.
I don’t know if these are still commonly used but I couldn’t find an
existing implementation in the Ruby Standard Library.

I also tried to document the functions using rdoc so someone else
might make use of it. For those who haven’t tried it yet, just type
‘rdoc’ at the command prompt and it makes a nice doc directory under
the current directory with an index.html to start browsing the
file/classes/methods. Nice!

Back to the task…

My wanting a solution that coped with all difficult expressions that
Ruby itself can deal with (using the lexicon allowed) meant having to
get things like the aforementioned negation with parentheses working:

-(—3) # => 3

…and power precedence (as others have pointed out) combined with
negation and parentheses turned out to be tricky:

64**-(-(-3+5)32) #=> a big number

There’s a big list of test cases such as these in my unit tests
(included).

So having written the lexer class, I now set up the state transition
table and ask the lexer to tokenize the expression. The tokens are
then parsed and the bytecode is generated using a simple mapping.

The code follows; there are two files in total.

Thanks for another fun challenge and congrats all round for reaching
the 100th Ruby Q.!

Marcel

#!/usr/bin/env ruby
##################################################################

= compiler_mw.rb - bytecode compiler

Author:: Marcel W. <wardies ^a-t^ gmaildotcom>

Documentation:: Marcel W.

Last Modified:: Monday, 06 November 2006

require ‘interp’
require ‘lexer_mw’

module Compiler

The lexer needs to know the character sets involved in deciding

which state transition will be fired…

CHAR_SETS = {
:plus => [?+], :minus => [?-],
:digit => /\d/,
:div_mod => [?/, ?%], # matches ‘/’ or ‘%’
:asterisk => [?*],
:open_paren => [?(], :close_paren => [?)]
}

Tell the lexer how to parse a datastream: which tokens to

generate, what state to switch to, etc.

This table was designed according to my vague recollection of

the dragon book on compiler construction by Aho/Sethi/Ullman.

STATE_TRANS_TABLE = {
:s_start => {
:plus => {:next_s_skip => :s_start},
:minus => {:next_s => :s_negate},
:digit => {:next_s => :s_numeric},
:open_paren => {:next_s => :s_start,
:token => :tok_open_paren}
},
:s_negate => {
:plus => {:next_s_skip => :s_negate},
:minus => {:next_s => :s_start},
:digit => {:next_s => :s_numeric},
:open_paren => {:next_s_backtrack => :s_start,
:token => :tok_negate}
},
:s_numeric => {
:plus => {:next_s_backtrack => :s_operator,
:token => :tok_int},
:minus => {:next_s_backtrack => :s_operator,
:token => :tok_int},
:digit => {:next_s => :s_numeric},
:div_mod => {:next_s_backtrack => :s_operator,
:token => :tok_int},
:asterisk => {:next_s_backtrack => :s_operator,
:token => :tok_int},
:close_paren => {:next_s_backtrack => :s_operator,
:token => :tok_int},
:eof => {:next_s_backtrack => :s_operator,
:token => :tok_int},
},
:s_operator => {
:plus => {:next_s => :s_start,
:token => :tok_add},
:minus => {:next_s => :s_start,
:token => :tok_subtract},
:div_mod => {:next_s => :s_start,
:token => :tok_div_mod},
:asterisk => {:next_s => :s_mult_or_power},
:close_paren => {:next_s => :s_operator,
:token => :tok_close_paren},
:eof => {} # when :next_s… is absent, finish
},
:s_mult_or_power => {
:plus => {:next_s_backtrack => :s_start,
:token => :tok_multiply},
:minus => {:next_s_backtrack => :s_start,
:token => :tok_multiply},
:digit => {:next_s_backtrack => :s_start,
:token => :tok_multiply},
:asterisk => {:next_s => :s_start,
:token => :tok_power},
:open_paren => {:next_s_backtrack => :s_start,
:token => :tok_multiply}
}
}

Compiles a string expression sum into bytecode and returns

the bytecode array (as per Ruby Q. 100 requirements).

def self.compile(sum)
lexer = LexerMW.new()
lexer.init_char_sets(CHAR_SETS)
lexer.init_state_transitions(STATE_TRANS_TABLE)

toks = lexer.tokenize(sum)

puts toks.inspect + "\n\n" + toks.map {|a,b| b}.join(' ') \
  if $DEBUG == 1

# Get the mnemonic stack by parsing the tokens.
mnemonic_stack = parse(toks)
puts "\nParsed toks => #{mnemonic_stack.inspect}" if $DEBUG == 1

# Last stage now, we convert our internal mnemonics directly
# to a byte stack in the required bytecode format.
mnemonics_to_bytecode(mnemonic_stack)

end

MNEMONIC_TO_BYTECODE = {
:tok_add => Interpreter::Ops::ADD,
:tok_subtract => Interpreter::Ops::SUB,
:tok_multiply => Interpreter::Ops::MUL,
:tok_divide => Interpreter::Ops::DIV,
:tok_modulo => Interpreter::Ops::MOD,
:tok_power => Interpreter::Ops::POW
}

This exception is raised by the mnemonic-to-bytecode method when

an integer constant cannot be pushed onto the interpreter

bytecode stack because it is too big to fit the

Interpreter::Ops::LCONST instruction.

class OutOfRangeError < StandardError
end

Convert our internal mnemonics directly to a byte array and

return this in the required bytecode format, ready to execute.

def self.mnemonics_to_bytecode(mnemonics)
bc = []
mnemonics.each do
|mnem|
if MNEMONIC_TO_BYTECODE.has_key? mnem
bc << MNEMONIC_TO_BYTECODE[mnem]
else
# Try packing this value as a 2-or 4-byte signed string
# and ensure we get back the same value on unpacking it.
if [mnem] == [mnem].pack(‘s’).unpack(‘s’)
# 2-bytes will be enough
bc << Interpreter::Ops::CONST
bc.concat([mnem].pack(‘n’).unpack(‘C*’))
elsif [mnem] == [mnem].pack(‘l’).unpack(‘l’)
# 4-bytes will be enough
bc << Interpreter::Ops::LCONST
bc.concat([mnem].pack(‘N’).unpack(‘C*’))
else
# It could be dangerous to silently fail when a
# number will not fit in a 4-byte signed int.
raise OutOfRangeError
end
end
end
bc
end

If there is a mismatch in the number of parenthesis, this

exception is raised by the #parse routine.

E.g. “3+(4-2” and “(3-10))” are both considered invalid.

class ParenthesisError < Exception
end

The operator precedence hash helps the #parse method to

decide when to store up operators and when to flush a load

out. The

PAREN_PRECEDENCE = 0
OP_PRECEDENCE = {
:tok_end => -1,
:tok_open_paren => PAREN_PRECEDENCE,
:tok_close_paren => PAREN_PRECEDENCE,
:tok_add => 1, :tok_subtract => 1,
:tok_multiply => 2, :tok_div_mod => 2,
:tok_power => 3,
:tok_negate => 4
}

Parse an array of [token,value] pairs as returned by

LexerMW::tokenize. Returns our own internal quasi-bytecode

mnemonic array.

def self.parse(tokens)
operator_stack = []
ops = []

# Push the bottom-most element with precedence equivalent to that
# of :tok_end so when we see :tok_end all pending operation
# tokens on the stack get popped
precedence_stack = [OP_PRECEDENCE[:tok_end]]

tokens.each do
  |tok, val|
  if tok == :tok_int
    # "--3".to_i => 0 is bad, so use eval("--3") => 3 instead.
    ops << eval(val)
  else
    precedence = OP_PRECEDENCE[tok]
    if not tok == :tok_open_paren
      while precedence <= precedence_stack.last &&
              precedence_stack.last > PAREN_PRECEDENCE
        # Workaround for the fact that the ** power operation
        # is calculated Right-to-left,
        # i.e. 2**3**4 == 2**(3**4) /= (2**3)**4
        break if tok == :tok_power &&
          precedence_stack.last == OP_PRECEDENCE[:tok_power]

        precedence_stack.pop
        ops << operator_stack.pop
      end
    end

    # Divide and modulo come out of the lexer as the same token
    # so override tok according to its corresponding value
    tok == :tok_div_mod && \
      tok = (val == '/') ? :tok_divide : :tok_modulo

    case tok
    when :tok_close_paren
      precedence_stack.pop == PAREN_PRECEDENCE \
        or raise ParenthesisError
    when :tok_negate
      # val contains just the minuses ('-', '--', '---', etc.)
      # Optimise out (x) === --(x) === ----(x), etc.
      if val.size % 2 == 1
        # No negate function for -(x) so simulate using 0 - (x)
        precedence_stack.push precedence
        operator_stack.push :tok_subtract
        ops << 0
      end
    when :tok_end
      raise ParenthesisError if precedence_stack.size != 1
    else
      precedence_stack.push precedence
      operator_stack.push tok unless tok == :tok_open_paren
    end
  end
end
ops

end
end

if $0 == FILE
eval DATA.read, nil, $0, LINE+4
end

END

require ‘test/unit’

class TC_Compiler < Test::Unit::TestCase
def test_simple
@test_data = [
‘8’, ‘124’, ‘32767’, # +ve CONST
‘-1’, ‘-545’, ‘-32768’, # -ve CONST
‘32768’, ‘294833’, ‘13298833’, # +ve LCONST
‘-32769’, ‘-429433’, ‘-24892810’, # -ve LCONST
‘4+5’, ‘7-3’, ‘30+40+50’, ‘14-52-125’, # ADD, SUB
‘512243+1877324’, ‘40394-12388423’, # LCONST, ADD, SUB
‘36’, '-42-90’, ‘94332*119939’, # MUL
‘8/3’, ‘-35/-15’, ‘593823/44549’, # DIV
‘8%3’, ‘243%-59’, ‘53%28%9’, # MOD
‘531%-81%14’, ‘849923%59422’, #
‘-2147483648–2147483648’, # SUB -ve LCONST
‘214’, '-413+2’ # POW
]
@test_data.each do
|sum|
assert_equal [eval(sum)],
Interpreter.new(Compiler.compile(sum)).run,
“whilst calculating ‘#{sum}’”
end
end

def test_advanced
@test_data = [
‘-(423)’, ‘-(-52332)', ‘—0’,
‘-(-(-(16**–++2)))’,
'3**(9%5-1)/3+1235349%319883+24
-3’,
‘+42’, ‘((2*-4-15/3)%16)’, ‘43((2*-4-15/3)%16)’,
‘64**-(-(-3+5)32)’, ‘4165%41341/7/2/15%15%13’,
‘–(—(43((2*-4-15/3)%16))++±410–4)’
]
@test_data.each do
|sum|
assert_equal [eval(sum)],
Interpreter.new(Compiler.compile(sum)).run,
“whilst calculating ‘#{sum}’”
end
end
end

#!/usr/bin/env ruby
##################################################################

= lexer_mw.rb - generic lexical analyser

Author:: Marcel W. <wardies ^a-t^ gmaildotcom>

Documentation:: Marcel W.

Last Modified:: Monday, 06 November 2006

Solution for Ruby Q. number 100 - http://www.rubyquiz.com/

$DEBUG = 0

If the lexer fails to find an appropriate entry in the state

transition table for the current character and state, it

raises this exception.

class LexerFailure < StandardError
end

If the lexer encounters a character for which no matching charset

has been supplied then it raises this exception.

This exception will never be raised if #init_state_transitions

has been called with an appropriate catch-all charset id.

class InvalidLexeme < StandardError
end

class LexerMW

Creates an instance of the lexer class.

lexer_eof_ascii::

defines the ASCII byte value that the lexer considers as

end-of-file when it is encountered. When #tokenize is called,

the supplied datastream is automatically appended with this

character.

def initialize(lexer_eof_ascii = 0)
@s_trans = {}
@columns = {}
@lex_eof = lexer_eof_ascii
end

Initialize the character set columns to be used by the lexer.

cs_defs::

a hash containing entries of the form id => match,

where match defines the characters to be matched and id

is the id that will be passed to the finite state machine

to inidicate the character grouping encountered.

eof_charset_id::

defines the character set identifier which the lexer will

attempt to match in the state machine table when the

end-of-file character defined in #new is encountered.

The content of match falls into one of two main categories:

regexp:: e.g. /\d/ will match any digit 0…9; or

enum:: an enumeration that describes the set of allowed

character byte values, e.g.

the array [?*, ?/, ?%] matches

*, / or %, while the range

(?a…?z) matches lowercase alphas.

e.g.

init_char_sets({

:alphanum => /[A-Z0-9]/,

:underscore => [?_],

:lower_vowel => [?a, ?e, ?i, ?o, ?u],

:special => (0…31)

},

:end_line)

It is the responsibility of the caller to ensure that the

match sets for each column are mutually exclusive.

If a ‘catch-all’ set is needed then it is not necessary

to build the set of all characters not already matched.

Instead, see #init_state_transitions parameter list.

Note, the contents of the hash is duplicated and stored

internally to avoid any inadvertent corruption from outside.

def init_char_sets(cs_defs, eof_charset_id = :eof)
@charsets = {}
# Make a verbatim copy of the lexer charset columns
cs_defs.each_pair do
|charset_id, match|
@charsets[charset_id] = match.dup # works for array/regexp
end
# Add an end-of-file charset column for free
@charsets[eof_charset_id] = [@lex_eof]
puts “@charsets =\n#{@charsets.inspect}\n\n” if $DEBUG == 1
end

Initialize the state transition table that will be used by the

finite state machine to convert incoming characters to tokens.

st::

a hash that defines the state transition table to be used

(see below).

start_state::

defines the starting state for the finite state machine.

catch_all_charset_id::

defines an optional charset id to be tried if the character

currently being analysed matches none of the charsets

in the charset table. The default +nil+ ensures that the

InvalidLexeme exception is raised if no charsets match.

The state transition table hash st maps each valid original

state to a hash containing the rules to match when in that

state.

Each hash entry rule maps one of the character set ids

(defined in the call to #init_char_sets) to the actions to be

carried out if the current character being analysed by the lexer

matches.

The action is a hash of distinct actions to be carried out for

a match. The following actions are supported:

:next_s => state::

sets the finite state machine next state to be state and

appends the current character to the lexeme string being

prepared, absorbing the current character in the datastream.

:next_s_skip => state::

as above but the lexeme string being prepared remains static.

:next_s_backtrack => state::

as for next_s_skip above but does not absorb the current

character (it will be used for the next state test).

:token => tok::

appends a hash containing a single entry to the array of

generated tokens, using tok as the key and a copy of the

prepared lexeme string as the value.

When the end of the datastream is reached, the lexer looks for

a match against charset :eof.

When the performed actions contain no +next_s+… action, the

lexer assumes that a final state has been reached and returns

the accumulated array of tokens up to that point.

e.g.

init_state_transitions({

:s1 => {:alpha => {next_s = :s2},

:period => {:token => :tok_period}},

:s2 => {:alphanum => {next_s = :s2},

:underscore => {next_s_skip == :s2},

:period => {next_s_backtrack = :s1}

:eof => {}}, // final state, return tokens

}, :s1, :other_chars)

Note, the contents of the hash is duplicated and stored

internally to avoid any inadvertent corruption from outside.

def init_state_transitions(st, start_state = :s_start,
catch_all_charset_id = nil)
@start_state = start_state
@others_key = catch_all_charset_id
@s_trans = {}
# Make a verbatim copy of the state transition table
st.each_pair do
|orig_state, lexer_rules|
@s_trans[orig_state] = state_rules = {}
lexer_rules.each_pair do
|lexer_charset, lexer_actions|
state_rules[lexer_charset] = cur_actions = {}
lexer_actions.each_pair do
|action, new_val|
cur_actions[action] = new_val
end
end
end
puts “@s_trans =\n#{@s_trans.inspect}\n\n” if $DEBUG == 1
end

Tokenize the datastream in str according to the specific

character set and state transition table initialized through

#init_char_sets and #init_state_transitions.

Returns an array of token elements where each element is

a pair of the form:

[:token_name, “extracted lexeme string”]

The end token marker [:tok_end, nil] is appended to the end

of the result on success, e.g.

tokenize(str)

# => [[:tok_a, “123”], [:tok_b, “abc”], [:tok_end, nil]]

Raises the LexerFailure exception if no matching state

transition is found for the current state and character.

def tokenize(str)
state = @start_state
lexeme = ‘’
tokens = []
# Append our end of file marker to the string to be tokenized
str += “%c” % @lex_eof
str.each_byte do
|char|
char_as_str = “%c” % char
loop do
match = @charsets.find {
|id, match|
(match.kind_of? Regexp) ?
(match =~ char_as_str) : (match.include? char)
} || [@others_key, @charsets[@others_key]] or
raise InvalidLexeme

    # Look for the action matching our current state and the
    # character set id for our current char.
    action = @s_trans[state][match.first] or raise LexerFailure

    # If found, action contains our hash of actions, e.g.
    # {:next_s_backtrack => :s_operator, :token => :tok_int}
    puts "#{char==@lex_eof?'<eof>':char_as_str}: " \
      "#{state.inspect} - #{action.inspect}" if $DEBUG == 1

    # Build up the lexeme unless we're backtracking or skipping
    lexeme << char_as_str if action.has_key? :next_s

    tokens << [action[:token], lexeme.dup] && lexeme = '' if \
      action.has_key? :token

    # Set the next state, or - when there is no specified next
    # state - we've finished, so return the tokens.
    state = action[:next_s] || action[:next_s_skip] ||
      action[:next_s_backtrack] or
         return tokens << [:tok_end, nil]

    break unless action.has_key? :next_s_backtrack
  end
end
tokens

end
end

if $0 == FILE
eval DATA.read, nil, $0, LINE+4
end

END

require ‘test/unit’

class TC_LexerMW < Test::Unit::TestCase
def test_simple
@lexer = LexerMW.new()

@char_sets = {
    :letter => (?a..?z),
    :digit => (/\d/),
    :space => [?\s, ?_]
  }

@lexer.init_char_sets(@char_sets)

@st = {
    :extract_chars => {
      :letter =>  {:next_s => :extract_chars},
      :digit =>   {:next_s => :extract_chars},
      :space =>   {:next_s_skip => :extract_chars,
                   :token => :tok_text},
      :eof =>     {:token => :tok_text}
      },
    :extract_alpha => {
      :letter =>  {:next_s => :extract_alpha},
      :digit =>   {:next_s_backtrack => :extract_num,
                   :token => :tok_alpha},
      :space =>   {:next_s_skip => :extract_alpha,
                   :token => :tok_alpha},
      :other =>   {:next_s_skip => :extract_alpha},
      :eof_exit => {}
      },
    :extract_num => {
      :letter =>  {:next_s_backtrack => :extract_alpha,
                   :token => :tok_num},
      :digit =>   {:next_s => :extract_num},
      :space =>   {:next_s_skip => :extract_num},
      :others =>  {:next_s_skip => :extract_alpha,
                   :token => :tok_num}
      }
  }
@lexer.init_state_transitions(@st, :extract_chars)
assert_equal [
    [:tok_text, "123"], [:tok_text, "45"],
    [:tok_text, "6"], [:tok_text, "78"],
    [:tok_text, "abcd"], [:tok_text, "efghi"],
    [:tok_text, "jklmn"], [:tok_end, nil]
  ], @lexer.tokenize("123 45 6_78 abcd efghi_jklmn")

@lexer = LexerMW.new(?$)
@lexer.init_char_sets(@char_sets, :eof_exit)
@lexer.init_state_transitions(@st, :extract_num, :others)
assert_equal [
    [:tok_num, "12345678"], [:tok_alpha, "abcd"],
    [:tok_alpha, "efghi"], [:tok_num, "445"],
    [:tok_alpha, ""], [:tok_num, "1222"], [:tok_end, nil]
  ], @lexer.tokenize("123 45 6_78 abcd efghi445!12_22!ab$45")

end
end

Some people started doing the Ruby quiz problems using Haskell, and
this was a perfect opportunity for me to learn some Haskell. So here’s
my solution below, in Haskell. It’s hard to test the byte code
interpretation but all the expression do evaluate to the correct
values.

If anyone has questions about the Haskell code, please let me know.
I’m just learning it and its really cool!

BTW, I too spent way more time on this than I should have!

(this solution, along with others, can be found on
Haskell Quiz/Bytecode Compiler - HaskellWiki)

This solution should work correctly. I was unable to test the byte
codes generated, for obvious reasons. However, all test strings from
the quiz do evaluate to the correct values.

To see the (symbolic) byte codes generated, run generate_tests. To see
the actual byte codes, run compile_tests. To see that the values
produced by each expression match those expected, run eval_tests. The
tests are contained in the variables test1,test2, …, test6, which
correspond to the six “test_n” methods fouind in the quiz’s test
program.

The byte codes aren’t optimized. For example, SWAP is never used.
However, they should produce correct results (even for negative and
LCONST/CONST values).

The code below is literate Haskell.

\begin{code}
import Text.ParserCombinators.Parsec hiding (parse)
import qualified Text.ParserCombinators.Parsec as P (parse)
import Text.ParserCombinators.Parsec.Expr
import Data.Bits

– Represents various operations that can be applied
– to expressions.
data Op = Plus | Minus | Mult | Div | Pow | Mod | Neg
deriving (Show, Eq)

– Represents expression we can build - either numbers or expressions
– connected by operators.
data Expression = Statement Op Expression Expression
| Val Integer
| Empty
deriving (Show)

– Define the byte codes that can be generated.
data Bytecode = NOOP | CONST Integer | LCONST Integer
| ADD
| SUB
| MUL
| POW
| DIV
| MOD
| SWAP
deriving (Show)

– Using imported Parsec.Expr library, build a parser for expressions.
expr :: Parser Expression
expr =
buildExpressionParser table factor

<?> "expression"

where
– Recognizes a factor in an expression
factor =
do{ char ‘(’
; x ← expr
; char ‘)’
; return x
}
<|> number
<?> "simple expression" -- Recognizes a number number :: Parser Expression number = do{ ds <- many1 digit ; return (Val (read ds)) } <?> “number”
– Specifies operator, associativity, precendence, and constructor to
execute
– and built AST with.
table =
[[prefix “-” (Statement Mult (Val (-1)))],
[binary “^” (Statement Pow) AssocRight],
[binary “*” (Statement Mult) AssocLeft, binary “/” (Statement
Div) AssocLeft, binary “%” (Statement Mod) AssocLeft],
[binary “+” (Statement Plus) AssocLeft, binary “-” (Statement
Minus) AssocLeft]
]
where
binary s f assoc
= Infix (do{ string s; return f}) assoc
prefix s f
= Prefix (do{ string s; return f})

– Parses a string into an AST, using the parser defined above
parse s = case P.parse expr “” s of
Right ast → ast
Left e → error $ show e

– Take AST and evaluate (mostly for testing)
eval (Val n) = n
eval (Statement op left right)
| op == Mult = eval left * eval right
| op == Minus = eval left - eval right
| op == Plus = eval left + eval right
| op == Div = eval left div eval right
| op == Pow = eval left ^ eval right
| op == Mod = eval left mod eval right

– Takes an AST and turns it into a byte code list
generate stmt = generate’ stmt []
where
generate’ (Statement op left right) instr =
let
li = generate’ left instr
ri = generate’ right instr
lri = li ++ ri
in case op of
Plus → lri ++ [ADD]
Minus → lri ++ [SUB]
Mult → lri ++ [MUL]
Div → lri ++ [DIV]
Mod → lri ++ [MOD]
Pow → lri ++ [POW]
generate’ (Val n) instr =
if abs(n) > 32768
then instr ++ [LCONST n]
else instr ++ [CONST n]

– Takes a statement and converts it into a list of actual bytes to
– be interpreted
compile s = toBytes (generate $ parse s)

– Convert a list of byte codes to a list of integer codes. If LCONST or
CONST
– instruction are seen, correct byte representantion is produced
toBytes ((NOOP):xs) = 0 : toBytes xs
toBytes ((CONST n):xs) = 1 : (toConstBytes (fromInteger n)) ++ toBytes
xs
toBytes ((LCONST n):xs) = 2 : (toLConstBytes (fromInteger n)) ++ toBytes
xs
toBytes ((ADD):xs) = 0x0a : toBytes xs
toBytes ((SUB):xs) = 0x0b : toBytes xs
toBytes ((MUL):xs) = 0x0c : toBytes xs
toBytes ((POW):xs) = 0x0d : toBytes xs
toBytes ((DIV):xs) = 0x0e : toBytes xs
toBytes ((MOD):xs) = 0x0f : toBytes xs
toBytes ((SWAP):xs) = 0x0a : toBytes xs
toBytes [] = []

– Convert number to CONST representation (2 element list)
toConstBytes n = toByteList 2 n
toLConstBytes n = toByteList 4 n

– Convert a number into a list of 8-bit bytes (big-endian/network byte
order).
– Make sure final list is size elements long
toByteList :: Bits Int => Int → Int → [Int]
toByteList size n =
if (length bytes) < size
then (replicate (size - (length bytes)) 0) ++ bytes
else bytes
where
bytes = reverse $ toByteList’ n
– for negative, and with signed bit and remove negative. Then
continue recursion.
toByteList’ 0 = []
toByteList’ a | a < 0 = (a .&. 511) : toByteList’ (abs(a) shiftR
8)
| otherwise = (a .&. 255) : toByteList’ (a shiftR
8)

– All tests defined by the quiz, with the associated values they
should evaluate to.
test1 = [(2+2, “2+2”), (2-2, “2-2”), (22, "22"), (2^2, “2^2”), (2
div 2, “2/2”),
(2 mod 2, “2%2”), (3 mod 2, “3%2”)]

test2 = [(2+2+2, “2+2+2”), (2-2-2, “2-2-2”), (222, “222”), (2^2^2,
“2^2^2”), (4 div 2 div 2, “4/2/2”),
(7mod2mod1, “7%2%1”)]

test3 = [(2+2-2, “2+2-2”), (2-2+2, “2-2+2”), (22+2, "22+2"), (2^2+2,
“2^2+2”),
(4 div 2+2, “4/2+2”), (7mod2+1, “7%2+1”)]

test4 = [(2+(2-2), “2+(2-2)”), (2-(2+2), “2-(2+2)”), (2+(22),
"2+(2
2)"), (2*(2+2), “2*(2+2)”),
(2^(2+2), “2^(2+2)”), (4 div (2+2), “4/(2+2)”), (7mod(2+1),
“7%(2+1)”)]

test5 = [(-2+(2-2), “-2+(2-2)”), (2-(-2+2), “2-(-2+2)”), (2+(2 * -2),
“2+(2*-2)”)]

test6 = [((3 div 3)+(8-2), “(3/3)+(8-2)”), ((1+3) div (2 div
2)(10-8), "(1+3)/(2/2)(10-8)“),
((13)4(56), “(13)4(56)”), ((10mod3)(2+2),
"(10%3)
(2+2)”), (2^(2+(3 div 2)^2), “2^(2+(3/2)^2)”),
((10 div (2+3)4), "(10/(2+3)4)"), (5+((54)mod(2+1)),
"5+((5
4)%(2+1))")]

– Evaluates the tests and makes sure the expressions match the expected
values
eval_tests = map eval_tests [test1, test2, test3, test4, test5, test6]
where
eval_tests ((val, stmt):ts) =
let eval_val = eval $ parse stmt
in
if val == eval_val
then “True” : eval_tests ts
else (stmt ++ " evaluated incorrectly to " ++ show eval_val ++
" instead of " ++ show val) : eval_tests ts
eval_tests [] = []

– Takes all the tests and displays symbolic bytes codes for each
generate_tests = map generate_all [test1,test2,test3,test4,test5,test6]
where generate_all ((val, stmt):ts) = generate (parse stmt) :
generate_all ts
generate_all [] = []

– Takes all tests and generates a list of bytes representing them
compile_tests = map compile_all [test1,test2,test3,test4,test5,test6]
where compile_all ((val, stmt):ts) = compile stmt : compile_all ts
compile_all [] = []

\end{code}

Here’s my effort. I really enjoyed this quiz but spent way more than
the 1 hour on it (and it’s still not correct)!

I tokenized the expression with a regex and then used the Shunting
Yard Algorithm to do the parsing. However, I fail some of tests -

  1. Unary minus [-2+(2-2)]

  2. Order of evaluation in this expression is right–>left as opposed
    to left–>right and gives wrong result [(1+3)/(2/2)*(10-8)]

  3. The optional efficient storage test.

I’ll try to fix these if I have more time.

Thanks

Bob

require “Interpreter.rb”

class Compiler

@PRECEDENCE = {
‘(’ => -1,
‘)’ => -1,
‘+’ => 0,
‘-’ => 0,
‘*’ => 1,
‘/’ => 1,
‘**’ => 3,
‘%’ => 3
}

@BYTECODE = {
‘CONST’ => 0x01,
‘LCONST’ => 0x02,
‘+’ => 0x0a,
‘-’ => 0x0b,
‘*’ => 0x0c,
‘**’ => 0x0d,
‘/’ => 0x0e,
‘%’ => 0x0f,
‘SWAP’ => 0xa0
}

def Compiler.compile(expression)
te = self.tokenize(expression)
be = self.parse(te)
self.to_bytecode(be)
end

def Compiler.tokenize(expression)
tokenized_expression = []

 expression.gsub!(/\s+/, "")
 expression.scan(/(\d+|\(|\)|\+|-|\*\*|\*|\/|\%)/) { |e|

tokenized_expression.push(e) }
tokenized_expression
end

def Compiler.parse(tokenized_expression)
output_queue, operator_stack = [], []
operator = nil

 tokenized_expression.each { |token|

   # If token is a number, place on output queue
   if (token[0] =~ /^\d+$/)
     output_queue.push(token[0].to_i)
   elsif (token[0] == '(')
     operator_stack.push(token[0])
   elsif (token[0] == ')')
     # Pop operators off stack and onto output queue until left
     # bracket encountered
     while (operator != '(')
       if ((operator = operator_stack.pop) != '(')
           output_queue.push(operator)
       end
     end
   else
     # If there are any operators, check precedence of current token
     # against last operator on queue.  If the operator on queue is
     # more important, add it to the output before pushing the

current
# operator on
if (operator_stack.any? && (@PRECEDENCE[token[0]] <=
@PRECEDENCE[operator_stack.last]))
output_queue.push(operator_stack.pop)
end
operator_stack.push(token[0])
end
}

 # Add the remaining operators to end of the output queue
 operator_stack.reverse_each { |operator|
   output_queue.push(operator)
 }

 output_queue

end

def Compiler.to_bytecode(bnf_expression)
stack = []

 bnf_expression.delete("(")
 bnf_expression.each { |token|
   case token
     when Integer
        # If number is small enough, use smaller 2 byte storage

       if ((token >= -32768) && (token <= 32767))
         stack.push(@BYTECODE['CONST'])
         stack.push(token >> 8, token)
       else
         stack.push(@BYTECODE['LCONST'])
         stack.push(token >> 24, token >> 16, token >> 8, token)
       end
     else
       stack.push(@BYTECODE[token])
   end
 }
 stack

end

end

require “TestByteCode.rb”

This quiz has turned out to have some interesting solutions, I’m looking
forward to writing up the summary :slight_smile: For the record, here’s the sample
solution I made when I suggested the quiz to James - it takes a similar
approach to some of the other solutions in letting Ruby’s parser do the
heavy lifting. I pulled _why’s Sandbox into the mix again to let me keep
dangerously modified core classes on a tight rein.

— compiler.rb

require ‘sandbox’

class Compiler
class << self
def sb
unless @sb
@sb = Sandbox.new

    @sb.eval <<-EOC
      class Object
        private
        def ldconsts(o)
          if (o > -32769) && (o < 32768)
            [].push(0x01, *[o].pack('n').unpack('C*'))
          else
            [].push(0x02, *[o].pack('N').unpack('C*'))
          end
        end
      end

      class Fixnum
        def +(o)
          if o.is_a? Array
            o.push(*ldconsts(self)).push(0x0a)
          else
            ldconsts(self).push(*ldconsts(o)).push(0x0a)
          end
        end

        def -(o)
          if o.is_a? Array
            o.push(*ldconsts(self)).push(0xa0, 0x0b)
          else
            ldconsts(self).push(*ldconsts(o)).push(0x0b)
          end
        end

        def *(o)
          if o.is_a? Array
            o.push(*ldconsts(self)).push(0x0c)
          else
            ldconsts(self).push(*ldconsts(o)).push(0x0c)
          end
        end

        def **(o)
          if o.is_a? Array
            o.push(*ldconsts(self)).push(0xa0, 0x0d)
          else
            ldconsts(self).push(*ldconsts(o)).push(0x0d)
          end
        end

        def /(o)
          if o.is_a? Array
            o.push(*ldconsts(self)).push(0xa0, 0x0e)
          else
            ldconsts(self).push(*ldconsts(o)).push(0x0e)
          end
        end

        def %(o)
          if o.is_a? Array
            o.push(*ldconsts(self)).push(0xa0, 0x0f)
          else
            ldconsts(self).push(*ldconsts(o)).push(0x0f)
          end
        end
      end

      class Array
        def +(o)
          if o.is_a? Array
            o.push(*self).push(0x0a)
          else
            self.push(*ldconsts(o)).push(0x0a)
          end
        end

        def -(o)
          if o.is_a? Array
            o.push(*self).push(0xa0, 0x0b)
          else
            self.push(*ldconsts(o)).push(0x0b)
          end
        end

        def *(o)
          if o.is_a? Array
            o.push(*self).push(0x0c)
          else
            self.push(*ldconsts(o)).push(0x0c)
          end
        end

        def **(o)
          if o.is_a? Array
            o.push(*self).push(0xa0, 0x0d)
          else
            self.push(*ldconsts(o)).push(0x0d)
          end
        end

        def /(o)
          if o.is_a? Array
            o.push(*self).push(0xa0, 0x0e)
          else
            self.push(*ldconsts(o)).push(0x0e)
          end
        end

        def %(o)
          if o.is_a? Array
            o.push(*self).push(0xa0, 0x0f)
          else
            self.push(*ldconsts(o)).push(0x0f)
          end
        end
      end
    EOC
  end

  @sb
end

def compile(code)
  [*sb.eval(code)]
end

end
end

My 2nd solution, it’s the same as the first except i stole the
pack/unpack stuff.

class Compiler
def self.compile(s)
eval(s.gsub(/(\d+)([^\d])/,’\1.bc\2’).gsub(/([^\d])(\d+)$/,’\1\2.bc’))
end
end

class Fixnum
def bc
lead,pt = ( (-215…215)===self ? [1,‘n’] : [2,‘N’] )
[lead].concat([self].pack(pt).unpack(‘C*’))
end
end

class Array
{:+ => 10,:- => 11,:* => 12,:** => 13,:confused: => 14,:% => 15}.each do
|op,code|
define_method(op) { |x| self.concat(x).concat([code]) }
end
end