Parsing JSON (#155)

bbazzarrakk · February 7, 2008, 3:09pm

We saw a large variety of solutions for this week’s problem. Many of
them used
a parser generator to construct their parser. You do that by defining a
grammar
that describes the syntax you need to read. The parser generator then
translates your grammar into parsing code that will match the described
syntax.
Of the generators used, Treetop was definitely the most popular and is
surely
worth a look if you want to do some grammar based parsing.

I’m not going to show a grammar based solution here, not because I don’t
like
them, but because I want to show a few of the ideas behind simple
parsing. To
do that, we will need to examine a hand-rolled solution. Just to be
clear
though, using grammar based parsers can often be a more robust choice if
you can
find an official grammar for the content you need to parse.

Eric M. sent in a hand-rolled solution that has quite a few
advantages.
First, it is trivial to adapt so that the entire content to be parsed
doesn’t
need to be read into memory. It’s also some very efficient code.
Unfortunately, that makes it a touch less obvious if you aren’t already
familiar
with parsing techniques.

Given that, I’m going to show my own hand-rolled recursive descent
parser. It’s
not as cool as Eric’s but it’s a little easier to take in. We will say
it’s a
good introduction to Eric’s code, which you should be able to follow
after I
break this down.

Here’s the beginning of my parser:

#!/usr/bin/env ruby -wKU

require “strscan”

class JSONParser
AST = Struct.new(:value)

def parse(input)
  @input = StringScanner.new(input)
  parse_value.value
ensure
  @input.eos? or error("Unexpected data")
end

private

# ...

One of the first concerns when parsing is the need to manage where you
currently
are in the input. If you treat the input as an IO object, you can read
input
piece by piece and the IO object itself naturally keeps track of where
you are.
For String input though, it’s often easier to use Ruby’s standard
StringScanner
class. It wraps the input and allows you to test regular expression
matches
against the head of that input. It will tell you when they match or
don’t and
when they do, your position automatically advances forward beyond that
match.
You can see that I set this up in the code above.

The only public method for my class is parse(). It prepares the
StringScanner
as I just described, tries to parse a JSON value out of the input, then
makes
sure we consumed all of the input. Note that my use of ensure here
isn’t very
standard. I’m just using it to run some code at the end of the method
without
changing the return value of the method.

The AST definition also merits a bit of discussion. It would have been
nice to
just have each method return the Ruby objects for the JSON it parsed.
However,
false and nil (null in JSON) are legal JSON parses in some places. If I
return
those, I won’t be able to tell if my parse succeeded or failed. To get
around
that, all parsed JSON values are wrapped in a trivial abstract syntax
tree
object. Returning this object is always true and, after I’ve verified
that the
parse worked, it’s just one more method call to retrieve the actual
value it
parsed.

It’s worth mentioning that this parser is based on the not quite correct
definition of JSON I put forth in the quiz tests. Only objects and
arrays are
allowed to be top-level JSON values. An easy fix is to replace this
line

  # ...

  parse_value.value

  # ...

with:

  # ...

  if top_level = parse_object || parse_array
    top_level.value
  else
    error("Illegal top-level JSON object")
  end

  # ...

Now let’s look at the main parser:

# ...

def parse_value
  trim_space
  parse_object  or
  parse_array   or
  parse_string  or
  parse_number  or
  parse_keyword or
  error("Illegal JSON value")
ensure
  trim_space
end

# ...

This method really sums up the basic strategy of recursive descent
parsing. At
each point, try to read out one of the legal values that can occur
there. You
can do that just by drilling down into more specialized methods that
know how to
read that one thing. If at any time you can’t read a legal value, you
have an
error.

Let’s dig into the specialized parsers a bit more to see how this works:

# ...

def parse_object
  if @input.scan(/\{\s*/)
    object     = Hash.new
    more_pairs = false
    while key = parse_string
      @input.scan(/\s*:\s*/) or error("Expecting object separator")
      object[key.value] = parse_value.value
      more_pairs = @input.scan(/\s*,\s*/) or break
    end
    error("Missing object pair") if more_pairs
    @input.scan(/\s*\}/) or error("Unclosed object")
    AST.new(object)
  else
    false
  end
end

# ...

This code reads JSON objects. It’s pretty linear, so let’s digest it in
order.
First, we have to have an opening brace or we don’t have an object at
all. We
can see here that I try a regular expression on the StringScanner to see
if that
is indeed what’s up next. If it is scan() will return true and @input
will
advance past that brace for our future matches. If it’s false, we can’t
read an
object and the whole attempt is a bust.

When we know we’re inside an object, we create the Ruby equivalent (a
Hash),
fill it with all of the string/value pairs we can read, then make sure
we find a
closing brace. Reading the pairs is the trickiest part because we have
to match
a string, followed by a colon, and finally a value. Then, if we find a
comma,
we know another pair is expected. If not, we’ve read the whole object.
Note
that I verify these assumptions at every step and toss error()s if any
of them
fail. For parsing strings and values, we just reuse the parse_string()
method
we first saw called in parse_value() and parse_value() itself.

You can see that I’m constantly trimming space around the JSON syntax.
I could
have also done that with repeated calls to the trim_space() helper we
saw used
in parse_value(), but that fattens up the code a bit and slows things
down with
more tests. For these simple cases, I opted for the shortcut.

Having deciphered parse_object(), parse_array() is trivial:

# ...

def parse_array
  if @input.scan(/\[\s*/)
    array       = Array.new
    more_values = false
    while contents = parse_value rescue nil
      array << contents.value
      more_values = @input.scan(/\s*,\s*/) or break
    end
    error("Missing value") if more_values
    @input.scan(/\s*\]/) or error("Unclosed array")
    AST.new(array)
  else
    false
  end
end

# ...

This is identical to the process we just examined save that it’s pulling
individual values in the middle instead of string/value pairs. This
simplifies
the code a bit, as you can see. We also throw these objects into a Ruby
Array
instead of a Hash.

Now we are ready for parse_string() and it has a couple of helpers:

# ...

def parse_string
  if @input.scan(/"/)
    string = String.new
    while contents = parse_string_content || parse_string_escape
      string << contents.value
    end
    @input.scan(/"/) or error("Unclosed string")
    AST.new(string)
  else
    false
  end
end

def parse_string_content
  @input.scan(/[^\\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
  if @input.scan(%r{\\["\\/]})
    AST.new(@input.matched[-1])
  elsif @input.scan(/\\[bfnrt]/)
    AST.new(eval(%Q{"#{@input.matched}"}))
  elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
    AST.new([Integer("0x#{@input.matched[2..-1]}")].pack("U"))
  else
    false
  end
end

# ...

Whenever a structure you need to read gets more complicated, you want to
break
it down into smaller parsers that read more specialized pieces. Some
probably
would have broken down the the string/value pairs from object (into a
parse_object_pair()), but you don’t gain much for that and it was just
simple
enough that I opted for the easier approach. Here though we need to
handle
content and escapes differently and there may be any combination of them
in any
order inside the string. Now the gain is worth it.

Content is easy enough to handle, since we can pass it through
unaltered. It’s
already content in a Ruby String object. Escapes we have to work on a
little
more. Some we just unescape, but others need to be converted. I used
pack() to
handle Unicode, but you can see that I was lazy and used eval() on the
special
string escapes. All of these have the same meaning in Ruby and thanks
to the
match I know it’s safe to eval() the contents without worrying about
embedded
Ruby nastiness.

With those defined, parse_string() is similar to parse_array(). Find
the start
of a JSON string, pull content and escapes as long as we can, then find
the end
of the string.

The last two parsers are the easiest of all:

# ...

def parse_number
  @input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\b/) and
  AST.new(eval(@input.matched))
end

def parse_keyword
  @input.scan(/\b(?:true|false|null)\b/) and
  AST.new(eval(@input.matched.sub("null", "nil")))
end

# ...

These are just match and eval() as you can plainly see. Again the
evals() are
safe because the match ensures we aren’t facing any dangerous content.

Some feel that using regular expressions like this isn’t true parsing.
We
certainly could have chopped the number rule down into a bunch of
smaller rules.
However, the number definition is squarely in the domain of what regular
expressions do well and I’m more of a practical kind of guy. I have
access to
regular expressions with this setup, the needed expression isn’t really
all that
complex, and I like easy jobs. Thus I use them.

All that is left are the two helpers I used, though the implementations
won’t be
any surprise:

# ...

def trim_space
  @input.scan(/\s+/)
end

def error(message)
  if @input.eos?
    raise "Unexpected end of input."
  else
    raise "#{message}:  #{@input.peek(@input.string.length)}"
  end
end

end

First, trim_space() can just try a match to advance us pass any
whitespace. It
may fail, because there wasn’t any whitespace to skip, but that doesn’t
affect
us any. We know that we aren’t about to read whitespace after it is
called,
either way.

My error() wrapper just raise()s exceptions. It adds the content left
to parse
so you can see where you had trouble or replaces the message altogether
to warn
you that all of the content was exhausted.

That’s all it takes to build a simple JSON parser. Take some time to go
look
through the other hand-rolled solutions now and I bet you’ll be
surprised by how
similar they work. Then you can look into grammars and how they
simplify the
process of defining new grammars.

The final Ruby Q. will take us back into the world of finance…

bbazzarrakk · February 8, 2008, 7:28pm

On Feb 7, 2008 8:08 AM, Ruby Q. [email protected] wrote:

We saw a large variety of solutions for this week’s problem. Many of them
used
a parser generator to construct their parser. You do that by defining a
grammar
that describes the syntax you need to read. The parser generator then
translates your grammar into parsing code that will match the described
syntax.
Of the generators used, Treetop was definitely the most popular and is
surely
worth a look if you want to do some grammar based parsing.

But not nearly the fastest… My very old Grammar class already
generates
parsers 5-10X faster than treetop and my next release will give another
5-10X plus using ruby2cext will give another 5-10X (generates code that
optimizes well with ruby2cext). Only a pure-C parser would beat it.
I’ve
had this stuff sitting around locally for too long and need to release.

… i’ll get off my (biased) soapbox now …

Eric M. sent in a hand-rolled solution that has quite a few
advantages.

First, it is trivial to adapt so that the entire content to be parsed
doesn’t
need to be read into memory. It’s also some very efficient code.

Computationally, I believe this is the simplest (most efficient) type -
LL(1). I parses "L"eft-to-right and derives the "L"eftmost part of a
grammar using (1) token/character. When the token is a character it
becomes
very easy to parse a file by (just need a variable to hold the next
character in the file - previously read with IO.getc).

Unfortunately, that makes it a touch less obvious if you aren’t already
familiar
with parsing techniques.

Yep. When rolling by hand, Regexp makes it a bit easier to do
recursive
descent parsers since you can easily do more than just a character at a
time. If no Regexp has unlimited match length, you’ll have an LL(k)
parser
where k is that max match length. You could adapt reading from a file
by
keeping a string buffer of the next (k) characters in the file. I you
have
a Regexp that can have unlimited match length, I think you might call
that
an LL(*) parser. You’ll have more problems parsing from a file in this
case, unless possibly you can keep all matches contained in a line
(IO.getswould act as a mini-parser looking for newline).

Look here if you want to see more about LL parsers:

that the
parse worked, it’s just one more method call to retrieve the actual value
it
parsed.

The approach I like to take is have the caller pass the AST to be
modified.
I just use a sequence (Array or even String) to represent the sequence
of
branches at that level. Then each grammar method can put none to an
arbitrary number of branches in the AST. There is much more
flexibility.
The grammar method just returns true or false for whether it matches
independent of the AST. Some parsers don’t even need an AST (or a
sparse
one) since things might be done on the fly as parsing occurs. Passing
the
AST separately works in all these cases.

bbazzarrakk · February 8, 2008, 7:57pm

On Feb 8, 2008, at 12:27 PM, Eric M. wrote:

translates your grammar into parsing code that will match the
described
syntax.
Of the generators used, Treetop was definitely the most popular and
is
surely
worth a look if you want to do some grammar based parsing.

But not nearly the fastest…

That’s true, but I think the need for raw speed in parsers is seldom
the top priority. Large XML can often cause speed issues, but outside
of that I don’t personally need many super fast parsers. Maybe I just
don’t run into those problems a lot though.

James Edward G. II

bbazzarrakk · February 9, 2008, 1:02am

James G. wrote:

On Feb 8, 2008, at 12:27 PM, Eric M. wrote:

Treetop was definitely the most popular
But not nearly the fastest…
That’s true, but I think the need for raw speed in parsers is seldom
the top priority.

I was frankly amazed how much of the discussion was about speed.
Besides, there’s an awful lot of useful languages that you can’t
do with LL(1). Including Ruby and C…

That said, Treetop is very slow, and we need to improve that.
Fortunately, it’s not very hard to.

bbazzarrakk · February 9, 2008, 1:41am

On Feb 8, 2008 5:59 PM, Clifford H. [email protected] wrote:

James G. wrote:

On Feb 8, 2008, at 12:27 PM, Eric M. wrote:

Treetop was definitely the most popular
But not nearly the fastest…
That’s true, but I think the need for raw speed in parsers is seldom
the top priority.

I was frankly amazed how much of the discussion was about speed.
Besides, there’s an awful lot of useful languages that you can’t
do with LL(1). Including Ruby and C…

There are techniques to convert LL(k) and LR(k) grammars to LL(1) and
LR(1)
(factoring, left-recursion removal), but they can make the grammar get
large
and can make the actions difficult. C shouldn’t be an issue. Ruby is
definitely a beast to parse. Starting with the YACC spec, (almost) any
LL/recursive-descent/packrat parser will have a hard time dealing with
the
left-recursion. Not to mention all of the lexer states. Starting from
scratch is probably a better choice for many
LL/recursive-descent/packrat in
handling LR grammars (with left-recursion). That being said, I have a
way
of handling left-recursion directly in my development LL(1/*) parser
generator. I haven’t seen any other LL parsers do it.

bbazzarrakk · February 9, 2008, 3:45am

Eric M. wrote:

There are techniques to convert LL(k) and LR(k) grammars to LL(1) and LR(1)
(factoring, left-recursion removal), but they can make the grammar get large

That’s an understatement! They must magnify the number of rules
by something like M^(k-1), where M is the number of terminals…
or something like that. It’s not reasonable to consider, and still
doesn’t do LL(*) like a packrat does.

C labels require 2 levels of lookahead.

bbazzarrakk · February 9, 2008, 4:36am

On Feb 8, 2008, at 5:59 PM, Clifford H. wrote:

James G. wrote:

On Feb 8, 2008, at 12:27 PM, Eric M. wrote:

Treetop was definitely the most popular
But not nearly the fastest…
That’s true, but I think the need for raw speed in parsers is
seldom the top priority.

I was frankly amazed how much of the discussion was about speed.

Don’t be. It’s a common pastime for little challenges like this.
It’s the easiest metric to quantify, so it makes for fun
comparisons.

James Edward G. II

bbazzarrakk · February 9, 2008, 8:00am

I was frankly amazed how much of the discussion was about speed.

I personally found it quite interesting to see how well a hand-crafted
parser could perform. I initially assumed the hackish RE-based
solution would be fastest.

Another aspect would of course be how long it takes to create such a
parser in comparison to the other solutions. Unfortunately, we don’t
have timings for that.

That said, Treetop is very slow, and we need to improve that.

My main concern with treetop isn’t so much speed but rather that the
polygot approach. While the idea per se is pretty cool, it seems to
preclude a programmatic generation/extension of a grammar definition.
Eg store the rules as an array of lambdas, programmatically add a new
rule on demand, recreate the parser etc. If I understand it right,
this isn’t easily possible with treetop. I scanned the source code a
little bit and it seems treetop requires the parser definition to be
saved on disk. I assume you’re now going to tell me I’m wrong? Please
do so. I’d be glad to here it’s otherwise.

Regards,
Thomas.

bbazzarrakk · February 9, 2008, 12:25pm

ThoML wrote:

I was frankly amazed how much of the discussion was about speed.
I personally found it quite interesting to see how well a hand-crafted
parser could perform. I initially assumed the hackish RE-based
solution would be fastest.

Yes, it was interesting. I just thought that a few other things
were interesting too

My main concern with treetop isn’t so much speed but rather that the
polygot approach.

Well, I’m the author of the Polyglot gem, and that doesn’t preclude
the idea of executing Ruby blocks during the parse. I’d be much in
favor of that actually, but it’s done Nathan’s way as he created
Treetop. I like the way ANTLR and grammar let you build the parse
tree you want. Or not. But remember, packrat needss parse tree nodes
that contain enough to enable the backtracking.

Since Treetop has the ability to request a custom node (using the
<my_syntax_node_class> annotation) perhaps you can do what you want
by adding code to the constructor in that class? Just remember though,
the node might get discarded during backtracking, so don’t do anything
too permanent :-).

Eg store the rules as an array of lambdas, programmatically add a new
rule on demand, recreate the parser etc. If I understand it right,
this isn’t easily possible with treetop.

I think it would be a sin against the gods of good taste to do that,
especially when the thing that did it might get discarded!

I scanned the source code a
little bit and it seems treetop requires the parser definition to be
saved on disk. I assume you’re now going to tell me I’m wrong? Please
do so. I’d be glad to here it’s otherwise.

It’s otherwise. The compiler uses a Builder pattern, and then either
saves that or eval’s it. When Polyglot requires a grammar, it first
passes through the normal (or rubygems+normal) require, which will
load a .rb file if one exists in the path. Otherwise it compiles
the grammar file and eval’s it.

I think, without testing it, that multiple require’s in the 2nd
situation will compile the grammar multiple times, which would be
a bug… I’ll test for that tomorrow… bed now ;-).

Clifford H…

bbazzarrakk · February 9, 2008, 4:46pm

On Feb 9, 2008 12:59 AM, ThoML [email protected] wrote:

I was frankly amazed how much of the discussion was about speed.

I personally found it quite interesting to see how well a hand-crafted
parser could perform. I initially assumed the hackish RE-based
solution would be fastest.

I would have thought that too. I knew my LL(1) hand solution would be
near
the top but was surprised that it came out on top (at least on my
machine/benchmark) especially considering the RE+eval (especially yours)
solutions. I think object creation was a big factor. My LL(1) parser
created basically nothing but the AST. The match objects likely slowed
down
the RE solutions (StringScanner was probably best off since the match
object
is just a string). When you turn off GC, some of the RE solutions beat
mine.

Another aspect would of course be how long it takes to create such a

parser in comparison to the other solutions. Unfortunately, we don’t
have timings for that.

I’m guessing that you could crank out an initial RE solution the
fastest.
But, those solutions were also the ones that failed as more tests were
added, since they were hacky. The solutions that took no shortcuts and
followed the spec to a T from the beginning may have taken longer
initially,
but they had fewer issues.

That said, Treetop is very slow, and we need to improve that.

My main concern with treetop isn’t so much speed but rather that the
polygot approach. While the idea per se is pretty cool, it seems to
preclude a programmatic generation/extension of a grammar definition.
Eg store the rules as an array of lambdas, programmatically add a new
rule on demand, recreate the parser etc.

My Grammar classes definitely give you that ability. One of the many
powers
of using Ruby as the DSL.

For example, if you deal with a “,” separated list terminated by “;” for
a
variety of items types, you could do something like this:

comma_list = lambda { |item| item.list1(E(?;), E(?,)) }

then use it later like this

string_list = comma_list[string] # string defined earlier
number_list = comma_list[number] # number defined earlier

One of the powers is creating your own generic grammar constructs on the
fly.

bbazzarrakk · February 20, 2008, 12:24am

Clifford H. wrote:

That said, Treetop is very slow, and we need to improve that.
Fortunately, it’s not very hard to.

I was reading about PEG parsers on the wikipedia page [1], and wondered
along to a site about packrat parsers [2].

One article there [3] seems to imply that while a packrat parser’s
memoization gives linear theoretical time, for almost all real grammars
(and inputs, I presume) it’s actually not necessary.

They suggest that for the cases where there is a an exponential time
problem, it’s sensible to rely on user supplied memoization directives
in the grammar, or to add memoization selectively and automatically,
based on profiling information.

I’m not sure if that’s any help, or if you’ve already come across these
articles. Hopefully they’re of some interest though.

Cheers,
Benjohn

[1]Parsing expression grammar - Wikipedia
[2]Page Redirector
[3]http://www.mercury.csse.unimelb.edu.au/information/papers/packrat.pdf

bbazzarrakk · February 20, 2008, 3:25am

Benjohn B. wrote:

One article there [3] seems to imply that while a packrat parser’s
memoization gives linear theoretical time, for almost all real grammars
(and inputs, I presume) it’s actually not necessary.

There’s a bit of a chicken/egg problem here.

Existing computer languages have been created to be easy to parse,
which means they require minimum lookahead- one or two tokens.

In part, that’s because such languages are easier for humans to
parse as well, and in part because parser generators couldn’t
produce efficient parsers for them… until now. Now that packrat
is on the scene, we’re free to create more natural looking grammars.

They suggest that for the cases where there is a an exponential time
problem, it’s sensible to rely on user supplied memoization directives
in the grammar,

That requires a lot more user understanding (of where to memoize),
or a whole lot more analysis. ANTLR takes the 2nd approach, but it’s
a difficult task that still causes many surprises. Treetop is slow,
but it’s easily accessible to Joe Average without special skills and
without major surprises. I think it fills an important niche.

or to add memoization selectively and automatically,
based on profiling information.

That would be cool. Write an exponential-time parser, run it over
your test cases in learning mode, and then regenerate it to get a
fast parser.

Clifford H…

bbazzarrakk · March 3, 2008, 12:26am

Clifford H. wrote:

In part, that’s because such languages are easier for humans to
parse as well, and in part because parser generators couldn’t
produce efficient parsers for them… until now. Now that packrat
is on the scene, we’re free to create more natural looking grammars.

That’s an interesting suggestion - that languages are as they are to a
great extent because of the difficulty of implementing their parsers, so
existing languages don’t need memoized parsing. Certainly a compelling
point. Do you see newer parsing opening up language design then?

I bumped in to a few suggestions of plugable language syntax while
reading about PEG, and I definitely like that idea.

Cheers,
B