Parsing JSON (#155)

bbazzarrakk · February 6, 2008, 3:47pm

On Feb 6, 2008 2:05 AM, tho_mica_l [email protected] wrote:

I think the benefit of Regexp goes down the more of them you are using.

They are a convenient solution for a wide range of problems.

You are taking my statement out of context. I was referring to
performance. If you are using a bunch of little regex’s, you won’t see
much
performance benefit compared to matching a character at a time with a 1
character lookahead (LL(1) parsing). I think you can write a pure ruby
LL(1) lexer/parser that usually meets or beats a regex version.

But yes, using regex’s are are convenient for a wide range of problems.

With respect to extending ruby’s regexp, I’d rather be interested in

embedded code snippets that are evaluated when a clause matches. I
didn’t know perl (perlre - Perl regular expressions - Perldoc Browser) does this until
I read about it in the ragel manual. It seems there are two kinds of
embedded code: (1) evaluate the code and match with null width; (2)
use the return value of the code as regexp. This sound quite
interesting to me. It’s currently listed under “A-3. Lacked features
compare with perl 5.8.0” in the ruby RE manual.

Yep. I used this in perl 5+ years ago (before coming to ruby) to write
recursive regex’s (i.e. paren matching).

But, this doesn’t address the issue I was referring to - applying a
Regexp
to something other than a String.

bbazzarrakk · February 6, 2008, 9:05am

I think the benefit of Regexp goes down the more of them you are using.

They are a convenient solution for a wide range of problems.

With respect to extending ruby’s regexp, I’d rather be interested in
embedded code snippets that are evaluated when a clause matches. I
didn’t know perl (perlre - Perl regular expressions - Perldoc Browser) does this until
I read about it in the ragel manual. It seems there are two kinds of
embedded code: (1) evaluate the code and match with null width; (2)
use the return value of the code as regexp. This sound quite
interesting to me. It’s currently listed under “A-3. Lacked features
compare with perl 5.8.0” in the ruby RE manual.

Regards,
Thomas.

bbazzarrakk · February 6, 2008, 5:17pm

On Feb 5, 2008, at 5:27 PM, Eric M. wrote:

ch/s F E gzip author/gem

186042 5 0 685 James Edward G. II (RE, recursive descent)

Here’s a revised version of this parser that passes all of the tests:

#!/usr/bin/env ruby -wKU

require “strscan”

class JSONParser
AST = Struct.new(:value)

def parse(input)
@input = StringScanner.new(input.strip)
parse_value.value
ensure
@input.eos? or error(“Unexpected data”)
end

private

def parse_value
trim_space
parse_object or
parse_array or
parse_string or
parse_number or
parse_keyword or
error(“Illegal JSON value”)
ensure
trim_space
end

def parse_object
if @input.scan(/{\s*/)
object = Hash.new
more_pairs = false
while key = parse_string
@input.scan(/\s*:\s*/) or error(“Expecting object separator”)
object[key.value] = parse_value.value
more_pairs = @input.scan(/\s*,\s*/) or break
end
error(“Missing object pair”) if more_pairs
@input.scan(/\s*}\s*/) or error(“Unclosed object”)
AST.new(object)
else
false
end
end

def parse_array
if @input.scan(/[\s*/)
array = Array.new
more_values = false
while contents = parse_value rescue nil
array << contents.value
more_values = @input.scan(/\s*,\s*/) or break
end
error(“Missing value”) if more_values
@input.scan(/\s*]\s*/) or error(“Unclosed array”)
AST.new(array)
else
false
end
end

def parse_string
if @input.scan(/"/)
string = String.new
while contents = parse_string_content || parse_string_escape
string << contents.value
end
@input.scan(/"\s*/) or error(“Unclosed string”)
AST.new(string)
else
false
end
end

def parse_string_content
@input.scan(/[^\"]+/) and AST.new(@input.matched)
end

def parse_string_escape
if @input.scan(%r{\["\/]})
AST.new(@input.matched[-1])
elsif @input.scan(/\[bfnrt]/)
AST.new(eval(%Q{"#{@input.matched}"}))
elsif @input.scan(/\u[0-9a-fA-F]{4}/)
AST.new([Integer(“0x#{@input.matched[2…-1]}”)].pack(“U”))
else
false
end
end

def parse_number
@input.scan(/-?(?:0|[1-9]\d*)(?:.\d+)?(?:[eE][±]?\d+)?\b/) and
AST.new(eval(@input.matched))
end

def parse_keyword
@input.scan(/\b(?:true|false|null)\b/) and
AST.new(eval(@input.matched.sub(“null”, “nil”)))
end

def trim_space
@input.scan(/\s+/)
end

def error(message)
if @input.eos?
raise “Unexpected end of input.”
else
raise “#{message}: #{@input.peek(@input.string.length)}”
end
end
end

END

James Edward G. II

bbazzarrakk · February 7, 2008, 12:33pm

On Feb 6, 2008 9:42 AM, James G. [email protected] wrote:

Here’s a revised version of this parser that passes all of the tests:

#!/usr/bin/env ruby -wKU

Hi James,

I applied these fixes to my optimization of your StringScanner recursive
descent parser and applied a few more. I also got it working in
ruby1.9.
It is at the end of this message. I couldn’t figure out why yours
wasn’t
working with 1.9.

Also, I did some more benchmarking. This time I ran 4 passes for each
dataset and I picked the best bandwidth (ch/s) among the 2 longest
running
datasets. I limited the depth to 10 because only the fastest 2 parsers
need
deeper (to not parse <1s) and ruby seems to do strange things with
garbage
collection. All of the fastest parsers are shown with results from the
depth=10 dataset.

I ran everything I could easily on 1.9 (didn’t load any gems for 1.9).

Here are the results with this benchmarking that I think is more
reliable:

1.8 1.9 F E gzip author/gem

 -      - 5 0  545 Pawel R. (RE, recursive descent)
NL      - 6 0  796 James K. (RE+StringIO, recursive descent)
NL     NL 5 1  683 Justin E. (RE lexer, ruby eval, fixes)
NL      - 3 2 1074 James Edward G. II (peggy)

4662 SF 0 0 608 Eric M. (Grammar0, no code-gen)
5264 - 6 2 588 ghostwheel (ghostwheel, fixes)
5764 - 2 0 706 Eric I (Treetop, unicode broken)
8246 - 2 0 631 Steve (Treetop, mismatches in benchmark)
11856 - 1 1 545 Clifford H. (Treetop)
25841 - 0 0 842 Alexander Stedile (RE, recursive descent)
57217 - 0 0 723 Eric M. (Grammar, v0.5)
152205 NL 2 1 660 Paolo B. (RE, recursive descent)
- 191529 2 1 445 Thomas Link (RE lexer, ruby eval)
202090 - 0 0 774 James Edward G. II (RE, recursive descent)
233646 - 0 0 653 Eric M. (Grammar, unreleased)
270508 211266 1 0 - json
299259 - 6 0 - fjson (w/ C ext)
351163 235726 0 0 607 James & Eric M (RE, recursive)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)
430251 106285 0 0 837 Eric M. (LL(1), recursive descent)
2079190 - 0 0 653 Eric M. (Grammar, unreleased, ruby2cext)
8534416 3217453 0 0 - json (w/ C ext)

NL: non-linear, SF: seg-fault

Here are some things to note:

on closer inspection, some of the slower solutions are nonlinear
(usually
quadratic runtime). ch/s is not meaningful since it just gets worse
with
larger data sets.
ruby1.9 killed my LL(1) parser performance. The reason for this is
that
characters in 1.9 are full objects, where in 1.8 they are immediate
Fixnum’s. 4X slower is not good. I need to go ask if we can get some
immediate 1-character Strings in ruby 1.9 to solve this. Anything that
worked with characters in 1.8 is going to see the same problem. My
Grammar
classes generate similiar LL(1) parsers and will be affected in the same
way. It works so well in 1.8 because there is so little object creation
(using Regexp’s creates more objects).
Dominick Baton’s ruby2cext is giving my dev Grammar class a 10X
performance boost. Woohoo! Only 4X away from the best hand-crafted C
(not
bad considering the ruby and then the C were autogenerated). I generate
low-level stuff that optimizes well. High-level Regexp’s will have
little
benefit from ruby2cext.

Here is the benchmark code I used (along with the previously posted
RandomJSON class):

parser = JSONParser.new
generator = RandomJSON.new((ARGV[1]||0).to_i)
bandwidth = 0
bandwidth0 = 0
t0 = 0
l = nil
t = nil
max_depth = 10
(max_depth+1).times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
s.gsub!(/=>/, ‘:’)
s.gsub!(/nil/, ‘null’)
tree2 = nil
l = s.length
t = nil
4.times {
Benchmark.bm { |b|
GC.start
t1 = b.report(“#{depth} #{l} “) { tree2 = parser.parse(s) }
GC.start
raise(”#{tree2.inspect}!=#{tree.inspect}”) if tree2!=tree
GC.start
if (!t or t1.real<t.real)
t = t1
end
}
}
bandwidth = l/t.real
puts “#{bandwidth.to_i} chars/second”
break if (t.real>=(ARGV[0]||1).to_f or depth>=max_depth)
if (t.real>t0)
bandwidth0 = bandwidth
t0 = t.real
end
}
bandwidth = bandwidth0 if (bandwidth0>bandwidth)

puts “\n#{bandwidth.to_i} chars/second”

Here is another optimization of James’ solution:

require “strscan”

class JSONParser

def parse(input)
@input = StringScanner.new(input)
@input.scan(/\s*/)
parse_value(out=[])
@input.eos? or error(“Unexpected data”)
out[0]
end

private

def parse_value(out)
if @input.scan(/{\s*/)
object = {}
kv = []
until @input.scan(/}\s*/)
object.empty? or @input.scan(/,\s*/) or error(“Expected ,”)
kv.clear
@input.scan(/“/) or error(“Expected string”)
parse_string(kv)
@input.scan(/:\s*/) or error(“Expecting object separator”)
parse_value(kv)
object[kv[0]] = kv[1]
end
out << object
elsif @input.scan(/[\s*/)
array = []
until @input.scan(/]\s*/)
array.empty? or @input.scan(/,\s*/) or error(“Expected ,”)
parse_value(array)
end
out << array
elsif @input.scan(/”/)
parse_string(out)
elsif @input.scan(/-?(?:0|[1-9]\d*)(?:.\d+)?(?:[eE][±]?\d+)?\s*/)
out << eval(@input.matched)
elsif @input.scan(/true\s*/)
out << true
elsif @input.scan(/false\s*/)
out << false
elsif @input.scan(/null\s*/)
out << nil
else
error(“Illegal JSON value”)
end
end

def parse_string(out)
string = “”
while true
if @input.scan(/[^\“]+/)
string.concat(@input.matched)
elsif @input.scan(%r{\[”\/]})
string << @input.matched[-1]
elsif @input.scan(/\[bfnrt]/)
string << eval(%Q{“#{@input.matched}”})
elsif @input.scan(/\u[0-9a-fA-F]{4}/)
string << @input.matched[2…-1].to_i(16)
else
break
end
end
@input.scan(/"\s*/) or error(“Unclosed string”)
out << string
end

def error(message)
if @input.eos?
raise “Unexpected end of input.”
else
raise “#{message}: #{@input.peek(@input.string.length)}”
end
end
end

bbazzarrakk · February 7, 2008, 1:31pm

Hi,

 - 191529 2 1  445 Thomas Link (RE lexer, ruby eval)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)

May I ask which test are missing here, since it passes all my test.

2079190 - 0 0 653 Eric M. (Grammar, unreleased, ruby2cext)

This made me quite curious. Unfortunately it seems ruby2cext doesn’t
support ruby19 yet. Anyway, will this coming-up version of grammar be
a drop-in replacement/update for grammar 0.5, i.e. is it safe to use
grammar 0.5 now?

BTW I took Paolo’s hacked version of my humble submission and
converted it for use with StringScanner which has better performance
it seems. Since Paolo uses numbered groups, StringScanner is fine. I
slightly modified the regexp to catch something like ‘“a” “b”’.

Regards,
Thomas.

require ‘strscan’

class JSONParser

def initialize
    @rxe = /
        \[|\]|
        \{|\}|
        (:)|
        (,[ \t\r\n]*[}\]])|
        ,|
        ("(?>[^"\\]+|\\(?:u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*"(?![ \t\r

def parse(json)
    scanner = StringScanner.new(json)
    out = []
    until (scanner.skip(/[[:space:][:cntrl:]]*/); scanner.eos?)
        scan = scanner.scan(@rxe)
        if scanner[5] or scanner[2]
            invalid(scanner[2] || scanner[5])
        elsif scanner[1]
            out << '=>'
        elsif scanner[3]
            out << scanner[3].gsub(/#/, '\\\\#')
        elsif scanner[4]
            out << 'nil'
        else
            out << scan
        end
    end
    begin
        return eval(out.join(' '))
    rescue Exception => e
        invalid(json)
    end
end

def invalid(string)
    raise RuntimeError, 'Invalid JSON: %s' % string
end

end

bbazzarrakk · February 7, 2008, 4:39pm

On Feb 7, 2008 6:30 AM, tho_mica_l [email protected] wrote:

Hi,
 - 191529 2 1  445 Thomas Link (RE lexer, ruby eval)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)
May I ask which test are missing here, since it passes all my test.

I’m running all tests from this thread. Here are the ones that the
above
solutions failed:

assert_raise(RuntimeError) { @parser.parse(%Q{“a” “b”}) }
assert_equal( “a”, @parser.parse(%Q{“\u#{”%04X" % 97}“}) ) # 97 was
?a,
1.9 compat.
assert_equal(‘#{p 123}’, @parser.parse(%q{”\u0023{p 123}"}))

Maybe the versions of these I have are old.

2079190 - 0 0 653 Eric M. (Grammar, unreleased, ruby2cext)

This made me quite curious. Unfortunately it seems ruby2cext doesn’t
support ruby19 yet. Anyway, will this coming-up version of grammar be
a drop-in replacement/update for grammar 0.5, i.e. is it safe to use
grammar 0.5 now?

I’ll have a compatibility layer that will use the new stuff. You’ll
still
get all of the performance benefits, but this will ease any transition.
Most of the API changes will be minor. It will be a cleaner API.