We saw a large variety of solutions for this week’s problem. Many of
them used
a parser generator to construct their parser. You do that by defining a
grammar
that describes the syntax you need to read. The parser generator then
translates your grammar into parsing code that will match the described
syntax.
Of the generators used, Treetop was definitely the most popular and is
surely
worth a look if you want to do some grammar based parsing.
I’m not going to show a grammar based solution here, not because I don’t
like
them, but because I want to show a few of the ideas behind simple
parsing. To
do that, we will need to examine a hand-rolled solution. Just to be
clear
though, using grammar based parsers can often be a more robust choice if
you can
find an official grammar for the content you need to parse.
Eric M. sent in a hand-rolled solution that has quite a few
advantages.
First, it is trivial to adapt so that the entire content to be parsed
doesn’t
need to be read into memory. It’s also some very efficient code.
Unfortunately, that makes it a touch less obvious if you aren’t already
familiar
with parsing techniques.
Given that, I’m going to show my own hand-rolled recursive descent
parser. It’s
not as cool as Eric’s but it’s a little easier to take in. We will say
it’s a
good introduction to Eric’s code, which you should be able to follow
after I
break this down.
Here’s the beginning of my parser:
#!/usr/bin/env ruby -wKU
require “strscan”
class JSONParser
AST = Struct.new(:value)
def parse(input)
@input = StringScanner.new(input)
parse_value.value
ensure
@input.eos? or error("Unexpected data")
end
private
# ...
One of the first concerns when parsing is the need to manage where you
currently
are in the input. If you treat the input as an IO object, you can read
input
piece by piece and the IO object itself naturally keeps track of where
you are.
For String input though, it’s often easier to use Ruby’s standard
StringScanner
class. It wraps the input and allows you to test regular expression
matches
against the head of that input. It will tell you when they match or
don’t and
when they do, your position automatically advances forward beyond that
match.
You can see that I set this up in the code above.
The only public method for my class is parse(). It prepares the
StringScanner
as I just described, tries to parse a JSON value out of the input, then
makes
sure we consumed all of the input. Note that my use of ensure here
isn’t very
standard. I’m just using it to run some code at the end of the method
without
changing the return value of the method.
The AST definition also merits a bit of discussion. It would have been
nice to
just have each method return the Ruby objects for the JSON it parsed.
However,
false and nil (null in JSON) are legal JSON parses in some places. If I
return
those, I won’t be able to tell if my parse succeeded or failed. To get
around
that, all parsed JSON values are wrapped in a trivial abstract syntax
tree
object. Returning this object is always true and, after I’ve verified
that the
parse worked, it’s just one more method call to retrieve the actual
value it
parsed.
It’s worth mentioning that this parser is based on the not quite correct
definition of JSON I put forth in the quiz tests. Only objects and
arrays are
allowed to be top-level JSON values. An easy fix is to replace this
line
# ...
parse_value.value
# ...
with:
# ...
if top_level = parse_object || parse_array
top_level.value
else
error("Illegal top-level JSON object")
end
# ...
Now let’s look at the main parser:
# ...
def parse_value
trim_space
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or
error("Illegal JSON value")
ensure
trim_space
end
# ...
This method really sums up the basic strategy of recursive descent
parsing. At
each point, try to read out one of the legal values that can occur
there. You
can do that just by drilling down into more specialized methods that
know how to
read that one thing. If at any time you can’t read a legal value, you
have an
error.
Let’s dig into the specialized parsers a bit more to see how this works:
# ...
def parse_object
if @input.scan(/\{\s*/)
object = Hash.new
more_pairs = false
while key = parse_string
@input.scan(/\s*:\s*/) or error("Expecting object separator")
object[key.value] = parse_value.value
more_pairs = @input.scan(/\s*,\s*/) or break
end
error("Missing object pair") if more_pairs
@input.scan(/\s*\}/) or error("Unclosed object")
AST.new(object)
else
false
end
end
# ...
This code reads JSON objects. It’s pretty linear, so let’s digest it in
order.
First, we have to have an opening brace or we don’t have an object at
all. We
can see here that I try a regular expression on the StringScanner to see
if that
is indeed what’s up next. If it is scan() will return true and @input
will
advance past that brace for our future matches. If it’s false, we can’t
read an
object and the whole attempt is a bust.
When we know we’re inside an object, we create the Ruby equivalent (a
Hash),
fill it with all of the string/value pairs we can read, then make sure
we find a
closing brace. Reading the pairs is the trickiest part because we have
to match
a string, followed by a colon, and finally a value. Then, if we find a
comma,
we know another pair is expected. If not, we’ve read the whole object.
Note
that I verify these assumptions at every step and toss error()s if any
of them
fail. For parsing strings and values, we just reuse the parse_string()
method
we first saw called in parse_value() and parse_value() itself.
You can see that I’m constantly trimming space around the JSON syntax.
I could
have also done that with repeated calls to the trim_space() helper we
saw used
in parse_value(), but that fattens up the code a bit and slows things
down with
more tests. For these simple cases, I opted for the shortcut.
Having deciphered parse_object(), parse_array() is trivial:
# ...
def parse_array
if @input.scan(/\[\s*/)
array = Array.new
more_values = false
while contents = parse_value rescue nil
array << contents.value
more_values = @input.scan(/\s*,\s*/) or break
end
error("Missing value") if more_values
@input.scan(/\s*\]/) or error("Unclosed array")
AST.new(array)
else
false
end
end
# ...
This is identical to the process we just examined save that it’s pulling
individual values in the middle instead of string/value pairs. This
simplifies
the code a bit, as you can see. We also throw these objects into a Ruby
Array
instead of a Hash.
Now we are ready for parse_string() and it has a couple of helpers:
# ...
def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end
@input.scan(/"/) or error("Unclosed string")
AST.new(string)
else
false
end
end
def parse_string_content
@input.scan(/[^\\"]+/) and AST.new(@input.matched)
end
def parse_string_escape
if @input.scan(%r{\\["\\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))
elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
else
false
end
end
# ...
Whenever a structure you need to read gets more complicated, you want to
break
it down into smaller parsers that read more specialized pieces. Some
probably
would have broken down the the string/value pairs from object (into a
parse_object_pair()), but you don’t gain much for that and it was just
simple
enough that I opted for the easier approach. Here though we need to
handle
content and escapes differently and there may be any combination of them
in any
order inside the string. Now the gain is worth it.
Content is easy enough to handle, since we can pass it through
unaltered. It’s
already content in a Ruby String object. Escapes we have to work on a
little
more. Some we just unescape, but others need to be converted. I used
pack() to
handle Unicode, but you can see that I was lazy and used eval() on the
special
string escapes. All of these have the same meaning in Ruby and thanks
to the
match I know it’s safe to eval() the contents without worrying about
embedded
Ruby nastiness.
With those defined, parse_string() is similar to parse_array(). Find
the start
of a JSON string, pull content and escapes as long as we can, then find
the end
of the string.
The last two parsers are the easiest of all:
# ...
def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\b/) and
AST.new(eval(@input.matched))
end
def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub("null", "nil")))
end
# ...
These are just match and eval() as you can plainly see. Again the
evals() are
safe because the match ensures we aren’t facing any dangerous content.
Some feel that using regular expressions like this isn’t true parsing.
We
certainly could have chopped the number rule down into a bunch of
smaller rules.
However, the number definition is squarely in the domain of what regular
expressions do well and I’m more of a practical kind of guy. I have
access to
regular expressions with this setup, the needed expression isn’t really
all that
complex, and I like easy jobs. Thus I use them.
All that is left are the two helpers I used, though the implementations
won’t be
any surprise:
# ...
def trim_space
@input.scan(/\s+/)
end
def error(message)
if @input.eos?
raise "Unexpected end of input."
else
raise "#{message}: #{@input.peek(@input.string.length)}"
end
end
end
First, trim_space() can just try a match to advance us pass any
whitespace. It
may fail, because there wasn’t any whitespace to skip, but that doesn’t
affect
us any. We know that we aren’t about to read whitespace after it is
called,
either way.
My error() wrapper just raise()s exceptions. It adds the content left
to parse
so you can see where you had trouble or replaces the message altogether
to warn
you that all of the content was exhausted.
That’s all it takes to build a simple JSON parser. Take some time to go
look
through the other hand-rolled solutions now and I bet you’ll be
surprised by how
similar they work. Then you can look into grammars and how they
simplify the
process of defining new grammars.
The final Ruby Q. will take us back into the world of finance…