On Feb 6, 2008 9:42 AM, James G. [email protected] wrote:
Here’s a revised version of this parser that passes all of the tests:
#!/usr/bin/env ruby -wKU
Hi James,
I applied these fixes to my optimization of your StringScanner recursive
descent parser and applied a few more. I also got it working in
ruby1.9.
It is at the end of this message. I couldn’t figure out why yours
wasn’t
working with 1.9.
Also, I did some more benchmarking. This time I ran 4 passes for each
dataset and I picked the best bandwidth (ch/s) among the 2 longest
running
datasets. I limited the depth to 10 because only the fastest 2 parsers
need
deeper (to not parse <1s) and ruby seems to do strange things with
garbage
collection. All of the fastest parsers are shown with results from the
depth=10 dataset.
I ran everything I could easily on 1.9 (didn’t load any gems for 1.9).
Here are the results with this benchmarking that I think is more
reliable:
1.8 1.9 F E gzip author/gem
- - 5 0 545 Pawel R. (RE, recursive descent)
NL - 6 0 796 James K. (RE+StringIO, recursive descent)
NL NL 5 1 683 Justin E. (RE lexer, ruby eval, fixes)
NL - 3 2 1074 James Edward G. II (peggy)
4662 SF 0 0 608 Eric M. (Grammar0, no code-gen)
5264 - 6 2 588 ghostwheel (ghostwheel, fixes)
5764 - 2 0 706 Eric I (Treetop, unicode broken)
8246 - 2 0 631 Steve (Treetop, mismatches in benchmark)
11856 - 1 1 545 Clifford H. (Treetop)
25841 - 0 0 842 Alexander Stedile (RE, recursive descent)
57217 - 0 0 723 Eric M. (Grammar, v0.5)
152205 NL 2 1 660 Paolo B. (RE, recursive descent)
- 191529 2 1 445 Thomas Link (RE lexer, ruby eval)
202090 - 0 0 774 James Edward G. II (RE, recursive descent)
233646 - 0 0 653 Eric M. (Grammar, unreleased)
270508 211266 1 0 - json
299259 - 6 0 - fjson (w/ C ext)
351163 235726 0 0 607 James & Eric M (RE, recursive)
359665 279539 3 0 405 Thomas & Paolo (RE + eval)
430251 106285 0 0 837 Eric M. (LL(1), recursive descent)
2079190 - 0 0 653 Eric M. (Grammar, unreleased, ruby2cext)
8534416 3217453 0 0 - json (w/ C ext)
NL: non-linear, SF: seg-fault
Here are some things to note:
-
on closer inspection, some of the slower solutions are nonlinear
(usually
quadratic runtime). ch/s is not meaningful since it just gets worse
with
larger data sets.
-
ruby1.9 killed my LL(1) parser performance. The reason for this is
that
characters in 1.9 are full objects, where in 1.8 they are immediate
Fixnum’s. 4X slower is not good. I need to go ask if we can get some
immediate 1-character Strings in ruby 1.9 to solve this. Anything that
worked with characters in 1.8 is going to see the same problem. My
Grammar
classes generate similiar LL(1) parsers and will be affected in the same
way. It works so well in 1.8 because there is so little object creation
(using Regexp’s creates more objects).
-
Dominick Baton’s ruby2cext is giving my dev Grammar class a 10X
performance boost. Woohoo! Only 4X away from the best hand-crafted C
(not
bad considering the ruby and then the C were autogenerated). I generate
low-level stuff that optimizes well. High-level Regexp’s will have
little
benefit from ruby2cext.
Here is the benchmark code I used (along with the previously posted
RandomJSON class):
parser = JSONParser.new
generator = RandomJSON.new((ARGV[1]||0).to_i)
bandwidth = 0
bandwidth0 = 0
t0 = 0
l = nil
t = nil
max_depth = 10
(max_depth+1).times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
s.gsub!(/=>/, ‘:’)
s.gsub!(/nil/, ‘null’)
tree2 = nil
l = s.length
t = nil
4.times {
Benchmark.bm { |b|
GC.start
t1 = b.report(“#{depth} #{l} “) { tree2 = parser.parse(s) }
GC.start
raise(”#{tree2.inspect}!=#{tree.inspect}”) if tree2!=tree
GC.start
if (!t or t1.real<t.real)
t = t1
end
}
}
bandwidth = l/t.real
puts “#{bandwidth.to_i} chars/second”
break if (t.real>=(ARGV[0]||1).to_f or depth>=max_depth)
if (t.real>t0)
bandwidth0 = bandwidth
t0 = t.real
end
}
bandwidth = bandwidth0 if (bandwidth0>bandwidth)
puts “\n#{bandwidth.to_i} chars/second”
Here is another optimization of James’ solution:
require “strscan”
class JSONParser
def parse(input)
@input = StringScanner.new(input)
@input.scan(/\s*/)
parse_value(out=[])
@input.eos? or error(“Unexpected data”)
out[0]
end
private
def parse_value(out)
if @input.scan(/{\s*/)
object = {}
kv = []
until @input.scan(/}\s*/)
object.empty? or @input.scan(/,\s*/) or error(“Expected ,”)
kv.clear
@input.scan(/“/) or error(“Expected string”)
parse_string(kv)
@input.scan(/:\s*/) or error(“Expecting object separator”)
parse_value(kv)
object[kv[0]] = kv[1]
end
out << object
elsif @input.scan(/[\s*/)
array = []
until @input.scan(/]\s*/)
array.empty? or @input.scan(/,\s*/) or error(“Expected ,”)
parse_value(array)
end
out << array
elsif @input.scan(/”/)
parse_string(out)
elsif @input.scan(/-?(?:0|[1-9]\d*)(?:.\d+)?(?:[eE][±]?\d+)?\s*/)
out << eval(@input.matched)
elsif @input.scan(/true\s*/)
out << true
elsif @input.scan(/false\s*/)
out << false
elsif @input.scan(/null\s*/)
out << nil
else
error(“Illegal JSON value”)
end
end
def parse_string(out)
string = “”
while true
if @input.scan(/[^\“]+/)
string.concat(@input.matched)
elsif @input.scan(%r{\[”\/]})
string << @input.matched[-1]
elsif @input.scan(/\[bfnrt]/)
string << eval(%Q{“#{@input.matched}”})
elsif @input.scan(/\u[0-9a-fA-F]{4}/)
string << @input.matched[2…-1].to_i(16)
else
break
end
end
@input.scan(/"\s*/) or error(“Unclosed string”)
out << string
end
def error(message)
if @input.eos?
raise “Unexpected end of input.”
else
raise “#{message}: #{@input.peek(@input.string.length)}”
end
end
end