Parsing JSON (#155)

Clifford H. wrote:

Your example parsers are much harder to read than Treetop however,
compare:
space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard

Surely you can sugar that up a bit, something like

  space = Grammar.charset(" \t").discard

What do you mean it mismatches? Also what do you mean by self-checking?
It passes all the unit tests for me.

I think he meant that it doesn’t throw exceptions on error. This is
the
wrapper I used to make it conform with the provided tests:

class JSONParser
def initialize
@parser = JsonParser.new
end

def parse(text)
    rv = @parser.parse(text)
    if rv
        return rv.value
    else
        raise RuntimeError
    end
end

end

Regards,
Thomas.

Joel VanderWerf wrote:

Clifford H. wrote:

Your example parsers are much harder to read than Treetop however,
space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard
Surely you can sugar that up a bit, something like
space = Grammar.charset(" \t").discard

Well, maybe. I just grabbed the first incomprehensible line from
Eric’s Tcl example grammar, in order to point out a difference in
our meanings for “clean and pure” :-).

Clifford H…

assert_equal(’["#{ls -r}"]’, @parser.parse(%q{["#{ls -r}"]}))

People should really test their test case. At least I should test mine
before sending them to the list. This line should be:

assert_equal([’#{ls -r}’], @parser.parse(%q{["#{ls -r}"]}))

end
I was slightly surprised to find this in a full-blown parser. I think
this shortcut is a really bad choice here since it makes the parse
execute code on input like ["#{ls -r /}"]. This should definitely be
handled in the char rule. Justin E.'s solution suffers from the
same problem but he included a comment on this problem in his code.

Here are some random test cases to check this:

assert_raise(RuntimeError) { @parser.parse(%{p “Busted”}) }
assert_raise(RuntimeError) { @parser.parse(%{[], p “Busted”}) }
assert_raise(RuntimeError) { @parser.parse(%{[p “Busted”]}) }
assert_raise(RuntimeError) { @parser.parse(%{{1 =>
STDOUT.puts(“Busted”)}}) }
assert_raise(RuntimeError) { @parser.parse(%{"\u0022; p 123;
\u0022Busted"}) }
assert_raise(RuntimeError) { @parser.parse(%{"" p 123; “”}) }
assert_equal("\u0022; p 123; \u0022Busted",
@parser.parse(%{"\u0022; p 123; \u0022Busted"}))
assert_equal(’#{p 123}’, @parser.parse(%q{"#{p 123}"}))
assert_equal(’["#{ls -r}"]’, @parser.parse(%q{["#{ls -r}"]}))
assert_equal(’#{p 123}’, @parser.parse(%q{"\u0023{p 123}"}))
assert_equal(’#{p 123}’, @parser.parse(%q{"\u0023{p 123}"}))

Regards,
Thomas.

Here is my try using regexes. I use the “copy-on-write trick” from
the suffix tree quiz: the regex is always anchored to the beginning of
the string using \A, and the matched text is discarded using
post_match. In some places where I don’t want to discard I use (?
=…).

Using Eric’s benchmark I get 36kb/sec, but I haven’t benchmarked any
other solution.

http://pastie.caboo.se/147201

Paolo

tho_mica_l wrote:

end

I was slightly surprised to find this in a full-blown parser. I think
this shortcut is a really bad choice here since it makes the parse
execute code on input like ["#{ls -r /}"].

At least I admitted to cheating :-).

tho_mica_l wrote:

end

def parse(text)
    rv = @parser.parse(text)
    if rv
        return rv.value
    else
        raise RuntimeError
    end
end

end

If you check my version, you’ll see that you can raise a meaningful
RuntimeError by raising @parser.failure_reason - this produces an
error message that says how far the parse progressed, and which tokens
might have allowed it to progress further.

Clifford H…

On Feb 4, 11:01 am, Paolo B. [email protected] wrote:

Here is my try using regexes. I use the “copy-on-write trick” from
the suffix tree quiz: the regex is always anchored to the beginning of
the string using \A, and the matched text is discarded using
post_match. In some places where I don’t want to discard I use (?
=…).

Using Eric’s benchmark I get 36kb/sec, but I haven’t benchmarked any
other solution.

Parked at Loopia

For what it’s worth, I get 144kb/sec from tho_mica_l’s solution, after
converting it to Ruby 1.8 like this:

class JSONParser

RXE = /
    \[|\]|
    \{|\}|
    (:)|
    (,\s*[}\]])|
    ,|
    ("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
    -?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?=\D|$)|
    true|
    false|
    (null)|
    (?>[[:space:][:cntrl:]]+)|
    ((?>.+))
    /xmu

def parse(json)
    ruby = json.gsub(RXE) do |t|
        if !$5.nil?||!$2.nil?       then invalid($5.nil? ? $2 :

$5)
elsif !$4.nil? then ‘nil’
elsif !$1.nil? then ‘=>’
elsif !$3.nil? then $3.gsub(/#/, ‘\\#’)
else
t
end
end
begin
return eval(ruby)
rescue Exception => e
invalid(json)
end
end

def invalid(string)
    raise RuntimeError, 'Invalid JSON: %s' % string
end

end

Just an FYI solution, JSON is a subset of YAML. So

data = YAML.load(json)

T.

Just an FYI solution, JSON is a subset of YAML. So

data = YAML.load(json)

Interesting suggestion, but ruby seems to disagree:

irb(main):006:0> YAML.load_file ‘test.json’
ArgumentError: syntax error on line 0, col 99: `{“a”:2,“b”:
3.141,“TIME”:“2007-03-14T11:52:40”,“c”:“c”,“d”:[1,“b”,3.14],“COUNT”:
666,“e”:{“foo”:“bar”},“foo”:“B\u00e4r”,“g”:"\u677e\u672c\u884c
\u5f18",“h”:1000.0,“bar”:"\u00a9 \u2260 \u20ac!",“i”:
0.001,“j”:"\ud840\udc01"}’

On Feb 4, 2008 4:29 AM, Paolo B. [email protected] wrote:

Parked at Loopia
(,\s*[}]])|
def parse(json)
begin
end

Thank you. Now we can compare apples to apples with this on 1.8.6. I
are
the results with these two new benchmarks on my machine:

ch/s author/gem


  •   Pawel R. (RE, mismatch)
    

3214 Justin E. (RE lexer + ruby eval, fixed number parsing)
4054 Eric M. (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford H. (Treetop, had to remove handling of “/”)
54586 Eric M. (Grammar, no lexer, v0.5)
137989 Paolo B. (RE)
166041 Thomas Link (RE lexer + ruby eval, ruby 1.9 results)
220289 json
223486 Eric M. (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
333368 Thomas Link & Paolo B. (RE + eval, unicode broken)
553081 Eric M. (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

Looks like the above solution is fastest pure-ruby solution out there.
It’s
interesting that the performance is better in 1.8.6 than 1.9. The
“eval”
solution is a nice trick as it uses ruby’s C parser to act as the
parser.
But, this isn’t viable for most other languages you might parse. Maybe
fine
for this JSON case though. The “eval” does seem a bit dangerous though.

On Feb 3, 2008 11:45 PM, Clifford H. [email protected] wrote:

Eric M. wrote:

I’m biased,

First up, it sounds like your package is a significant achievement,
and I don’t want anything here to imply I’m not commending you on
it. But different approaches suit different goals…

Thanks!

   skip space
     [ \t]
   end

I agree. It is ugly. The ugliness can be easily fixed though:

  1. when making a parser, derive from Grammar so that namespace is
    available
  2. E = Element, so that you can have a shorthand for the most common
    atomic
    Grammar
  3. Performance-wise, I found that using a Set for matching was slower
    than
    just having a few (ranged) alternatives.
  4. Grammar::Element.[] is simply an alias for Grammar::Element.new. My
    plan
    is that in the next release, I’ll make all the classes “callable” by
    defining methods of their name that call new. This is better when .new
    can
    take a block (like Recurse).
  5. I made a bad choice of #== for the matching method for the pattern
    argument. #=== would be better and is what I’m using in my dev code
    (and
    the Grammar0 I posted). This would allow it to work out of the box with
    Ranges too. So, you could do this: alpha = E(?a…?z) | E(?A…?Z)

So, doing 1…3, in the released grammar v0.5, you’d have this instead:

space = (E[?\s] | E[?\t]).discard

(ok, so TT doesn’t have a skip keyword yet, soon…)

That’s what I mean by clean and pure anyhow. Not pure Ruby,
but pure grammar.

Nathan’s concept is that “grammar” should become a Ruby keyword,
and should introduce the grammar meta-grammar, rather than using
existing Ruby syntax. I think that polyglot approach is much better
than the DSL approach, since the Ruby grammar isn’t well-suited to
many important languages.

I don’t think ruby needs any new keywords. It already has more than it
needs in my opinion. There is enough power with just classes, methods,
blocks, and operator overloading.

The point is, if/when Ruby uses a parser whose grammar is composable

has good things about it also - like your filters etc…

Ruby DSL is far too awkward to express CQL - which is my primary
project at present.

Don’t know anything about CQL to answer. I would like to make what I
have
general enough.

  • integration of lexer/parser with other code is seamless since
    everything

is ruby

Nice, but it won’t help me when I need to target C# or Java.
Most important grammars need to be compilable to more than
one target language. Treetop’s not there yet, but it’s easy
to see how to do it, whereas it won’t ever happen with yours
AFAICS.

It definitely could happen with mine. I do ruby code generation now,
but
could have another Grammar-like class (or another mechanism) to generate
code for another language. I would have to add a new method for dumping
out
the generated code (right now it is only eval’ed immediately). In my
dev
code, I’m also using ruby2cext to compile it on the fly. I’ll try to
have a
general language infrastructure in the next release.

  • infinite backtracking (not by default because of performance and
    functionality, you specify what needs it)

Does this memoize, or give exponential performance? Or is it optional?

In general, the parsers it generates are LL(1). It makes decisions one
character/token at a time. But, if you use the #lookahead method on any
grammar, it will give a new grammar that handles the case when the
grammar
fails in the middle (backtracks to the position where the grammar
started).
For most languages I’ve seen this is rarely needed, but it is in your
back
pocket to be used when needed. I haven’t done anything to try to
optimize
it as it isn’t needed that often.

With ‘grammar’ you have complete control of the parsing result.
Without
actions, the output stream is a copy of the input stream. You can
group
things and discard things efficiently. One extreme would be to
interpret
the input on the fly and not build anything.

Yes, this is excellent, and something I’ve always wanted from a parser
generator. You should add your json parsers to the examples in the gem!

Yes, next release.

On Feb 3, 2008, at 10:51 AM, Eric M. wrote:

On Feb 1, 2008 7:55 AM, Ruby Q. [email protected] wrote:

  http://json.org/

This doc is missing two things: 1) exactly what is allowed for the
top-level json,

I guess it does hint at this when it says:

“JSON is built on two structures:”

and goes on to describe object and array. I just didn’t interpret it
well.

Thanks for clarifying.

and 2) where can whitespace appear.

Good point.

This is more complete:

http://www.ietf.org/rfc/rfc4627.txt

Good resource. Thanks for making me aware of it.

James Edward G. II

This works because the “null” token does not start with any letter
in “true” or “false”. “fnull” would be happily converted to “fnil”,
but eval catches that luckily.

I’m not sure I understand your argument. f will always be considered
illegal. It will never reach the eval(). Only text matching a clause
not
defined illegal in the rx get fed to eval(). You could also wrap the
atoms between \b (eg \bnull\b) but it seemed unnecessary. Show me an
example for malicious code that passes through.

The “eval” does seem a bit dangerous though.

Especially if you take it a bit to the edge like this (plug into the
other solution):

RXE = /
    [\[\]\{\}[:space:][:cntrl:]truefals]+|
    (:)|
    (?:,(?>\s*)(?![}\]]))|
    ("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
    -?(?=\d)(?>0|[1-9]\d*)(?>\.\d+)?(?>[Ee][+-]?\d+)?(?=\D|$)|
    (null)|
    (.+)
/xmu

def parse(json)
    ruby = json.gsub(RXE) do |t|
        if !$4.nil?         then invalid($4)
        elsif !$3.nil?      then 'nil'
        elsif !$1.nil?      then '=>'
        elsif !$2.nil?      then $2.gsub(/#/, '\\\\#')
        else
            t
        end
    end
    begin
        return eval(ruby)
    rescue Exception => e
        invalid(json)
    end
end

This works because the “null” token does not start with any letter
in “true” or “false”. “fnull” would be happily converted to “fnil”,
but eval catches that luckily.

As ugly as it is, it cranks in 200kb/sec on my machine, which should
be more than 400kb/sec on yours. 25-30% of C speed is quite
impressive (I had to say this since I was bitching about Ruby’s speed
last week…).

Paolo

This works because the “null” token does not start with any letter
in “true” or “false”. “fnull” would be happily converted to “fnil”,
but eval catches that luckily.

Well, I missed the changes in the regexp and thus missed your point.

The differences between ruby18 and ruby19 are quite interesting
though. Can somebody with deeper knowledge of ruby19 affirm that the
“copy on write” strategy is still in use?

Eric M. wrote:

compare:
space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard
So, doing 1…3, in the released grammar v0.5, you’d have this instead:
space = (E[?\s] | E[?\t]).discard

Definitely an improvement!

Nathan’s concept is that “grammar” should become a Ruby keyword,
I don’t think ruby needs any new keywords. It already has more than it
needs in my opinion. There is enough power with just classes, methods,
blocks, and operator overloading.

Part of the point of using a packrat-style parser is that there are
no true keywords, meaning words that must only be used in their KW
places. Many of Ruby’s words are like that, of course. But my point
was that once you say “grammar”, you’re now talking a different
language, which can use as much and only as much of the Ruby
grammar as it needs. And when inside a rule you say {, you’re back
in Ruby grammar… etc.

The normal lexer/parser split makes that fairly hard to do, as the
lexer needs to know whether to return the KW meaning or a general
identifier.

Ruby DSL is far too awkward to express CQL - which is my primary
project at present.
Don’t know anything about CQL to answer. I would like to make what I have
general enough.

Take a look at the ORM diagrams under the images directory, and
then at the CQL version of the same models, under
http://activefacts.rubyforge.org/svn/examples/. You’ll see a lot
of what looks like unparsable plain English… but it’s not. In
fact, the language is much harder to parse than any of those examples,
I don’t have any published examples of queries (these are just
declarations). The CQL parser is in
http://activefacts.rubyforge.org/svn/lib/activefacts/cql/ - you
might be able to see where it’s hard. You can be many tokens into
recognising a fact_clause before realising you’re looking at a
“condition”.

It’s possible that with your lookahead it would be possible though.

to see how to do it, whereas it won’t ever happen with yours AFAICS.
It definitely could happen with mine. I do ruby code generation now, but
could have another Grammar-like class

Cool! I didn’t realize you were generating code from this.

But, if you use the #lookahead method on any
grammar, it will give a new grammar that handles the case when the grammar
fails in the middle (backtracks to the position where the grammar started).

You mean it backtracks to the start of the file? Or to where you called
lookahead?

Clifford H…

On Feb 4, 2008 12:14 PM, Clifford H. [email protected] wrote:

I don’t think ruby needs any new keywords. It already has more than it

I definitely need to go learn about packrat/PEG stuff. Sounds
interesting
after looking at wikipedia. Still don’t really understand LR/LALR
parsers.
My main focus has been LL/recursive-descent and PEG sounds to be
recursive-descent.

The normal lexer/parser split makes that fairly hard to do, as the

lexer needs to know whether to return the KW meaning or a general
identifier.

Fortunately in my grammar package the lexer and parser can be specified
in
exactly the same way. Also, you can use anything for tokens. If you
are
using pattern to match to them, it will just use the expression
(pattern===next_token) to see if it is a match. So, in the lexer, you
might
generate a String for any identifier/keyword. The lexer wouldn’t care
whether it is a keyword or not. In the parser, you then could use this
to
match an arbitrary identifier:

E(String) # (String===next_token) is true if next_token is a String

or if you wanted to match a keyword, you’d have this type of Grammar:

E(“begin”) # (“begin”===next_token) is true if next_token==“begin”

You could also have some arbitrary matching function by defining your
own
pattern class (with a #== (v0.5) or #=== (dev) method).

fact, the language is much harder to parse than any of those examples,
I don’t have any published examples of queries (these are just
declarations). The CQL parser is in
http://activefacts.rubyforge.org/svn/lib/activefacts/cql/ - you
might be able to see where it’s hard. You can be many tokens into
recognising a fact_clause before realising you’re looking at a
“condition”.

It’s possible that with your lookahead it would be possible though.

At one time I made a mini English parser, so this seems quite doable.

fails in the middle (backtracks to the position where the grammar
started).

You mean it backtracks to the start of the file? Or to where you called
lookahead?

Just back to where that lookahead Grammar started parsing. For other
Grammar’s, there is only a one char/token lookahead. If it matches the
first char/token, it is committed to that and will give an exception if
a
mismatch occurs later. When you apply lookahead to a Grammar, it starts
buffering the input and will rescue these exceptions. It simply rewinds
the
input to the point where this lookahead Grammar started parsing when an
exception is found. That’s all there is to it.

When a lexer is used, I don’t think this backtracking is needed that
often.
I guess that is why this memoization is so important with packrat/PEG
parsers - no lexer.

I have to think about whether memoization would apply or not. It may
work
since my stuff carries around very little state (mainly just with
regards to
the input and output streams).

On Feb 1, 2008 7:55 AM, Ruby Q. [email protected] wrote:

In honor of that, this week’s Ruby Q. is to write a parser for JSON.

Here is another solution of mine:

http://pastie.caboo.se/147505

In this one, I just made a fast hand-built recursive-descent/LL(1)
parser.
This is the kind of parser that I’m trying to get my ‘grammar’ package
to
approach (using lots of optimizations). It uses no Regexp or ruby eval
(both of which have compiled C to help speed). And yet, it is the
fastest
pure-ruby JSON parser we’ve seen (see the recursive descent line below):

ch/s author/gem


  •   Pawel R. (RE, mismatch)
    

3214 Justin E. (RE lexer + ruby eval, fixed number parsing)
4054 Eric M. (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford H. (Treetop, had to remove handling of “/”)
17320 Alexander Stedile (RE)
54586 Eric M. (Grammar, no lexer, v0.5)
137989 Paolo B. (RE)
166041 Thomas Link (RE lexer + ruby eval, ruby 1.9 results)
220289 json
223486 Eric M. (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
333368 Thomas Link & Paolo B. (RE + eval, unicode broken)
388670 Eric M. (hand-built recursive descent)
553081 Eric M. (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

JSON hand-built recursive descent/LL(1) parser, by Eric M.

require ‘stringio’

class JSONParser

def parse(s)
    @next = (@io=StringIO.new(s)).getc
    ws
    value(out=[])
    ws
    raise("EOF expected") if @next
    raise(out.inspect) unless out.length==1
    out[0]
end

def error(expected, found)
    raise("expected #{expected}, found #{found ? ("'"<<found<<?\') :

‘EOF’}")
end

def value(out)
    if ?\[.equal?(@next)
        # array
        @[email protected]
        ws
        a = []
        unless ?\].equal?(@next)
            value(a)
            ws
            until ?\].equal?(@next)
                ?\,.equal?(@next) ? (@[email protected]) : error("','",

@next)
ws
value(a)
ws
end
end
@next = @io.getc
out << a
elsif ?{.equal?(@next)
# object
@[email protected]
ws
h = {}
unless ?}.equal?(@next)
?".equal?(@next) ? string(kv=[]) : error(“a string”,
@next)
ws
?:.equal?(@next) ? (@[email protected]) : error(“‘:’”,
@next)
ws
value(kv)
ws
h[kv[0]] = kv[1]
until ?}.equal?(@next)
?,.equal?(@next) ? (@[email protected]) : error(“‘,’”,
@next)
ws
?".equal?(@next) ? string(kv.clear) : error(“a
string”,
@next)
ws
?:.equal?(@next) ? (@[email protected]) : error(“‘:’”,
@next)
ws
value(kv)
ws
h[kv[0]] = kv[1]
end
end
@next = @io.getc
out << h
elsif (?a…?z)===(@next)
# boolean
(s=“”)<<@next
@next = @io.getc
while (?a…?z)===(@next)
s<<@next;@[email protected]
end
out << case s
when “true” then true
when “false” then false
when “null” then nil
else error(“‘true’ or ‘false’ or ‘null’”, s)
end
elsif ?".equal?(@next)
string(out)
else
# number
n = “”
(n<<@next;@[email protected]) if ?-.equal?(@next)
?0.equal?(@next) ? (n<<@next;@[email protected]) : digits(n)
(?..equal?(@next) ?
(n<<@next;@[email protected];digits(n);exp(n);true) :
exp(n)) ?
(out << n.to_f) :
(out << n.to_i)
end
end

# Flattening any of the methods below will improve performance 

further

def ws
    @next = @io.getc while (case @next;when 

?\s,?\t,?\n,?\r;true;end)
end

def digits(out)
    (?0..?9)===@next ? (out<<@next;@[email protected]) : error("a 

digit",
@next)
while (?0…?9)===@next; (out<<@next;@[email protected]); end
true
end

def exp(out)
    (case @next;when ?e,?E;true;end) ? (out<<@next;@[email protected]) :
        return
    (out<<@next;@[email protected]) if (case @next;when ?-,?+;true;end)
    digits(out)
end

def string(out)
    # we've already verified the starting "
    @[email protected]
    s = ""
    until ?\".equal?(@next)
        if ?\\.equal?(@next)
            @next = @io.getc
            case @next
            when ?\",?\\,?\/ then (s<<@next;@[email protected])
            when ?b then (s<<?\b;@[email protected])
            when ?f then (s<<?\f;@[email protected])
            when ?n then (s<<?\n;@[email protected])
            when ?r then (s<<?\r;@[email protected])
            when ?t then (s<<?\t;@[email protected])
            when ?u
                @next = @io.getc
                u = ""
                4.times {
                    case @next
                    when ?0..?9, ?a..?f, ?A..?F
                        u<<@next;@[email protected]
                    else
                        error("a hex character", @next)
                    end
                }
                s << u.to_i(16)
            else
                error("a valid escape", @next)
            end
        else
            error("a character", @next) unless @next
            s<<@next;@[email protected]
        end
    end
    @next = @io.getc
    out << s
end

end