Parsing JSON (#155)

bbazzarrakk · February 3, 2008, 5:25pm

I liked this quiz because it made me look into treetop, ragel and
some
other libraries I wanted to examine a little bit closer for quite a
while now. Anyway, for my (official) solution I took the easy road
and
rely on ruby to do the actual work.

My solution is ruby19 only, since I used the opportunity to explore
some
of the new regexp features. Since it uses eval(), there is a
possibility
for ruby code injection like the ultimatively relieving “#{sudo rm - rf /}”. I think my solution catches such attacks though.

BTW, what do you all think should be the canonic output of the
following
JSON snippet:

json1 = <<JSON
{“a”:2,“b”:3.141,“TIME”:“2007-03-14T11:52:40”,“c”:“c”,“d”:[1,“b”,
3.14],“COUNT”:666,“e”:{“foo”:“bar”},“foo”:“B\u00e4r”,“g”:“\u677e
\u672c\u884c\u5f18”,“h”:1000.0,“bar”:“\u00a9 \u2260 \u20ac!”,“i”:
0.001,“j”:“\ud840\udc01”}
JSON

I get conflicting results between various versions of my solution and
the official ruby19 parser with respect to these utf characters. This
snippet is taken (IIRC) from the ruby-json parser.

Regards,
Thomas.

#!/usr/bin/env ruby19

Author:: Thomas Link (micathom AT gmail com)

Created:: 2008-02-01.

The string (in JSON format) is tokenized and pre-validated. Minor

replacements are made in order to transform the JSON into valid

ruby

input. The transformed string is then evaluated by ruby, which will

throw an exception on syntactic errors.

PROBLEMS:

- The “parser” doesn’t per se detect something like {“foo”: 1,} or

[1,2,] since this is valid in ruby. I’m not sure about JSON. Anyway,

I

included another “invalid” clause in order to catch these cases of

which I’m not sure how they are handled properly. If you want the

parser to be more permissive, remove the first “invalid” clause.

REFERENCES:

http://json.org

http://www.ietf.org/rfc/rfc4627.txt

class JSONParser

RXE = /
    \[|\]|
    \{|\}|
    (?<name_sep>:)|
    (?<invalid>,\s*[}\]])|
    ,|
    (?<string>"([^"\\]++|\\(u[0-9a-fA-F]{4}|[bfnrt"\/\\]))*")|
    -?(0|[1-9]\d*+)(\.\d++)?([Ee][+-]?\d++)?(?=\D|$)|
    true|
    false|
    (?<null>null)|
    [[:space:][:cntrl:]]++|
    (?<invalid>.++)
    /xmu

def parse(json)
    ruby = json.gsub(RXE) do |t|
        m = $~
        if m['invalid']     then invalid(m['invalid'])
        elsif m['null']     then 'nil'
        elsif m['name_sep'] then '=>'
        elsif m['string']   then m['string'].gsub(/#/, '\\\\#')
        else
            t
        end
    end
    begin
        return eval(ruby)
    rescue Exception => e
        invalid(json)
    end
end

def invalid(string)
    raise RuntimeError, 'Invalid JSON: %s' % string
end

end

if FILE == $0
a = ARGV.join
p a
p JSONParser.new.parse(a)
end

bbazzarrakk · February 3, 2008, 5:41pm

On Feb 1, 2008 11:30 AM, ara howard [email protected] wrote:

On Feb 1, 2008, at 9:09 AM, Eric M. wrote:

Maybe just have a little ruby
script that generates a stream of repeatable random (but valid) JSON.

cfp2:~ > cat a.rb
require ‘rubygems’
require ‘json’

The benchmark below has no gem dependencies, generates tests with deep
structures, and has self-checking. It found a bug that the original
unit
tested didn’t. By default, this benchmark keeps increasing the depth
until
the runtime is >= 1s.

Also, I made some wrappers for the json and fjson gems to see how they
perform. Here are the results on my machine:

fjson: 224823 chars/second
json(pure): 220289 chars/second
json(ext): 1522250 chars/second

require ‘benchmark’

class RandomJSON
def initialize(value=0)
@number = -1
@string = -1
@boolean = -1
@constant = -1
@value = value-1
end
def number
case (@number=(@number+1)%3)
when 0 : 0
when 1 : 1234
when 2 : 3.75e+1
end
end
def string
case (@string=(@string+1)%3)
when 0 : “”
when 1 : “JSON”
when 2 : “"\/\b\f\r\t”
end
end
def boolean
case (@boolean=(@boolean+1)%3)
when 0 : false
when 1 : true
when 2 : nil
end
end
def constant
case (@constant=(@constant+1)%3)
when 0 : number
when 1 : string
when 2 : boolean
end
end
def array(depth)
a = []
depth.times {
a << value(depth-1)
}
a
end
def object(depth)
o = {}
depth.times {
o[string] = value(depth-1)
}
o
end
def value(depth, v=nil)
case (v||(@value=(@value+1)%3))
when 0 : array(depth)
when 1 : object(depth)
else constant
end
end
end

generator = RandomJSON.new((ARGV[1]||0).to_i)

parser = JSONParser.new
Benchmark.bm { |b|
l = nil; t = nil
13.times { |depth|
tree = generator.value(depth, depth%2)
s = tree.inspect
#puts s
s.gsub!(/=>/, ‘:’)
s.gsub!(/nil/, ‘null’)
tree2 = nil
#puts s
l = s.length
t = b.report(“#{depth} #{l}”) { tree2 = parser.parse(s) }
raise if tree2!=tree
break if (t.real>=(ARGV[0]||1).to_f)
}
puts “#{(l/t.real).to_i} chars/second”
}

bbazzarrakk · February 3, 2008, 6:15pm

Here is a patch to make my solution check the return value for
conformance with the rfc:

— quiz155ir0.rb 2008-02-03 18:07:19.186889600 +0100
+++ quiz155i.rb 2008-02-03 18:08:58.008988800 +0100
@@ -48,10 +48,16 @@
end
end
begin

```
       return eval(ruby)
```

```
       value = eval(ruby)
```
```
       case value
```
```
       when Array
```
```
           return value
```
```
       when Hash
```

           return value if value.all? {|k, v| k.kind_of?

(String)}

```
       end
   rescue Exception => e
```

```
       invalid(json)
   end
```

```
   invalid(json)
```
end

def invalid(string)

bbazzarrakk · February 3, 2008, 5:52pm

On Feb 1, 2008 7:55 AM, Ruby Q. [email protected] wrote:

   http://json.org/

This doc is missing two things: 1) exactly what is allowed for the
top-level
json, and 2) where can whitespace appear. This is more complete:

http://www.ietf.org/rfc/rfc4627.txt

     def test_keyword_parsing

       assert_equal(true,  @parser.parse("true"))
       assert_equal(false, @parser.parse("false"))
       assert_equal(nil,   @parser.parse("null"))
     end

     def test_number_parsing
       assert_equal(42,      @parser.parse("42"))

…

     end

     def test_string_parsing
       assert_equal(String.new,          @parser.parse(%Q{""}))

…

end

The above isn’t legal JSON. An array or object (hash) should be at the
top-level. Surround these by brackets and it will be legal.

bbazzarrakk · February 3, 2008, 8:00pm

And here’s my solution. It passes all given tests but still there are
some tricky cases not handled well (see: code comments). I used
regular expressions extensively but it may not be the best idea in
terms of performance.

Thanks for the quiz!!

#!/usr/bin/env ruby

Solution to Ruby Q. #155 (see Ruby Quiz - Parsing JSON (#155))

by Pawe³ Radecki ([email protected]).

$KCODE=‘UTF-8’
require ‘jcode’

class JSONParser

def parse(input)
case input
# TODO: in every case we need to check if pattern matches the
input thoroughly and nothing is left;
# ex. “[3, 5] Pablos” still not handled well, it passes through
instead of giving exception

when '' : raise RuntimeError

# TODO: There needs to be some smart way of choosing whether we

found an object or an array;
# now object has priority and it may be found instead of an
array

#object
when /\{(".+"):(.+)\s*(,\s*(".+"):(.+))+\}/ then
  h = Hash.new
  $&[1...-1].split(/(.*:.*)?\s*,\s*(.*:.*)?/).each do |e|
    a = e.split(/:\s*(\{.*\}\s*)?/);
    h[parse(a.first)] = parse(a.last) unless (a.first.nil? &&

a.last.nil?)
end
h
when /{\s*(“.+”)\s*:\s*(.+)\s*}/ then { parse($1) => parse($2) }
when /{\s*}/ : Hash.new

#array
when /\[.+\]/ then $&[1...-1].split(/(\[.*\])?\s*,\s*(\[.*

])?/).collect{|e| parse(e)}
when /[\s*]/ then []

#constants
when /true/ then
  if ($`.strip.empty? && $'.strip.empty?) then true else raise

RuntimeError end
when /false/ then
if ($.strip.empty? && $'.strip.empty?) then false else raise RuntimeError end when /null/ then nil if ($.strip.empty? && $'.strip.empty?) then nil else raise
RuntimeError end

#string
when /"([A-Za-z]|(\s)|(\\")|(\\\\)|(\\\/)|(\\b)|(\\f)|(\\n)|(\\r)|

(\t)|(\u[0-9a-fA-F]{4,4}))+“/ : $&[1…-1].gsub(/\”/, ‘"’).gsub(/
\n/, “\n”).gsub(/\u([0-9a-fA-F]{4,4})/u){[“#$1”.hex ].pack(‘U*’)}
when /“”/ then “”

#number
when /-?(0|([1-9][0-9]*))(\.[0-9]+)?([e|E][+|-]?[0-9]+)?/ then
  if ($`.strip.empty? && $'.strip.empty?) then eval($&) else raise

RuntimeError end
else
raise RuntimeError
end
end
end

#puts JSONParser.new.parse(ARGV.first)

bbazzarrakk · February 4, 2008, 1:04am

On Feb 3, 2008 8:14 AM, steve [email protected] wrote:

File.open(“json.treetop”, “w”) {|f| f.write GRAMMAR }

BEGIN {

GRAMMAR = %q{

grammar Json

…

Steve,

I tried this parser, but couldn’t get it to work with the String
interface
of the quize testbench. Using an IO/stream is how a parser should work,
but
you should also be able to wrap a String in a StringIO. It complained
about
StringIO#index not being defined, but IO#index isn’t defined either.
Couldn’t find anyplace where IO#index was monkey-patched.

Eric

bbazzarrakk · February 4, 2008, 1:31am

On Feb 1, 2008 7:55 AM, Ruby Q. [email protected] wrote:

In honor of that, this week’s Ruby Q. is to write a parser for JSON.

My first solution uses an parser generator API similar to my rubyforge
“grammar” package. The parser generator is only 89 lines (excluding
comment/blank lines). The API is fairly complete to build a variety of
parsers. The same API can be use to build a parser, lexer,
preprocessor,
etc (and you could multi-thread them). The reason this is so simple is
that
I’m using Ruby as a DSL to specify the language grammar. The file
specifying the JSONParser class is only 58 lines.

Here is the simple Grammar0 parsing DSL class and the JSONParser class
using
it:

http://pastie.caboo.se/147074

http://pastie.caboo.se/147075

bbazzarrakk · February 4, 2008, 1:33am

On Feb 3, 2008 6:25 PM, Clifford H. [email protected] wrote:

neither does mine (emailed to JEG before the 24 hours - I’ll post
it soon). The adapter is needed because the parse() method expected
is different from the one that Treetop’s parsers provide.

I was expecting this to work as a proxy to the Treetop parser:

class JSONParser
def initialize
@parser = JsonParser.new
end
def parse(s)
@parser.parse(StringIO.new(s)).value
end
end

bbazzarrakk · February 4, 2008, 1:25am

Eric M. wrote:

I tried this parser, but couldn’t get it to work with the String interface
of the quize testbench. Using an IO/stream is how a parser should work,

Not a packrat parser. If you have memory to memoize each grammar rule
attempted at each byte-offset of the input, you have memory to have all
the input in memory at once, and it’s more efficient to do so.

Obviously a simple adapter can fix that, but steve’s doesn’t, and
neither does mine (emailed to JEG before the 24 hours - I’ll post
it soon). The adapter is needed because the parse() method expected
is different from the one that Treetop’s parsers provide.

Clifford H…

bbazzarrakk · February 4, 2008, 1:40am

Eric M. wrote:

I was expecting this to work as a proxy to the Treetop parser:

class JSONParser
def initialize
@parser = JsonParser.new
end
def parse(s)
@parser.parse(StringIO.new(s)).value
end
end

You need to pass the string directly, not as a StringIO.
Treetop parses a String, not an IO.

bbazzarrakk · February 4, 2008, 2:06am

On Feb 3, 2008 6:39 PM, Clifford H. [email protected] wrote:

end

You need to pass the string directly, not as a StringIO.
Treetop parses a String, not an IO.

I guess my eyes missed the “.read” after the STDIN . The “STDIN” jumped
out
and I thought it took an IO-like object. I was able to just remove a
couple
lines of Steve’s code and rename the class to get it to work with the
tests
(unit and random performance benchmark). It mismatches in both the unit
tests and the performance benchmark, but I if I remove the self-checking
I
was able to run the performance benchmark. It was only about 6800
chars/second which is near the bottom of the heap. Not sure how valid
that
is since the parser has problems though.

Eric

bbazzarrakk · February 4, 2008, 1:51am

On Feb 1, 2008 7:55 AM, Ruby Q. [email protected] wrote:

In honor of that, this week’s Ruby Q. is to write a parser for JSON.

Here is another solution that uses my (somewhat) optimized ‘grammar’
package
directly. The JSON grammar is similar to before. The next release will
use
an API closer to the simple Grammar0 package.

http://pastie.caboo.se/147078

Also, I ported this to my development code (which is not checked in to
CVS
yet), to see the performance. I grabbed all of the current submissions
along with the fjson and json gems to see how they compare in terms of
performance. I used the previous benchmark that I posted. It also
revealed
bugs in these submissions. Here is the performance I found on my
machine
with ruby 1.8.6:

ch/s author/gem

  oksteev (TreeTop, couldn't get it to parse a string)

```
  Pawel R. (RE, mismatch)
```

  Justin E. (RE lexer + parser, 71: Invalid number 0)

4054 Eric M. (Grammar0, no lexer, no parser generation)
54586 Eric M. (Grammar, no lexer, v0.5)
166041 Thomas Link (RE, ruby 1.9 results)
220289 json
223486 Eric M. (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
553081 Eric M. (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

Note that the Grammar variants don’t have the advantage of using RegExp
where you get some C performance. But, in my dev code, I’m using
ruby2cext
to get a little more performance. You could integrate a RegExp lexer
with a
Grammar parser, also.

bbazzarrakk · February 4, 2008, 2:10am

Here’s mine, done with Treetop. It also includes a
Readline-based interpretive checker. The generated
parser from Treetop has a slightly different interface,
so I’ve included JEG’s test program with an adapter at
the top.

The nicest thing about using Treetop is how close the
grammar to the JSON spec :-).

I prefer to convert hash keys to symbols, but the test
cases don’t allow that so I stripped out my .to_sym’s.

Note that the test cases are rather limited in things
like white-space handling (and in fact the JSON spec is
actually incorrect, in that it doesn’t define which rules
constitute tokens that may be separated by whitespace!)
Whitespace in Treetop must be handled explicitly, and it’s
easy to miss a spot where it should be skipped, so the
tests should cover that.

I welched on full Unicode support as commented in my code,
but there’s another test case you should apply, to parse
the string “\u1234”, which should throw an exception.
You’ll see that my code is missing that exception, and
will misbehave instead :-).

It wasn’t clear from the quiz or the JSON spec whether an
integer is valid JSON. I elected to accept any value, not
just an object or array.

Treetop now uses Polyglot, which loads the generated .rb
file if you’ve generated it, or the .treetop file if not.

Clifford H…

First, the interactive test program:

require ‘treetop’
require ‘json’ # Note that we can require the Treetop file directly.
require ‘readline’

parser = JsonParser.new
while line = Readline::readline("? ", [])
begin
tree = parser.parse(line)
if tree
p tree.obj
else
puts parser.failure_reason
end
rescue => e
puts e
p e.backtrace
p tree if tree
end
end
puts

Now, my test adapter:

class JSONParser
def parse(text)
parser = JsonParser.new
p = parser.parse(text)
raise parser.failure_reason unless p
p.obj
end
end

Finally, the grammar itself:

Treetop grammar for JSON for Ruby Q. #155 by Clifford H…

grammar Json
rule json
value
end

rule object
‘{’ s pairs:pairs? s ‘}’ s
{ def obj
pairs.empty? ? {} : pairs.obj
end
}
end

rule pairs
member rest:(s ‘,’ s member)*
{ def obj
rest.elements.inject({eval(member.k.text_value)
=> member.value.obj}) { |h, e|
h[eval(e.member.k.text_value)] =
e.member.value.obj
h
}
end
}
end

rule member # key/value pair of an object
k:string s ‘:’ s value
end

rule array
‘[’ s e:elements? s ‘]’
{ def obj
e.empty? ? [] : e.obj
end
}
end

rule elements # elements of an array
value rest:(s ‘,’ s value)*
{ def obj
rest.elements.inject([value.obj]) { |a, e|
a << e.value.obj
}
end
}
end

rule value
s alt:(string / number / object / array
/ ‘true’ { def obj; true; end }
/ ‘false’ { def obj; false; end }
/ ‘null’ { def obj; nil; end }
)
{ def obj; alt.obj; end }
end

rule string
‘"’ char* ‘"’ { def obj
eval(
# Strip Unicode characters down to the chr
equivalent.
# Note that I’m cheating here: ‘"\u4321"’
should assert,
# and there are cases that will succeed
but corrupt the data.
# This should be handled in the “char”
rule.
text_value.gsub(/\u…/) { |unicode|
eval(“0x”+unicode[2…-1]).chr
}
)
end
}
end

rule char
‘\’ ["\/bfnrt]
/ ‘\u’ hex hex hex hex
/ (![\"] .)
end

rule hex
[0-9A-Fa-f]
end

rule number
int frac? exp? { def obj; eval(text_value); end }
end

rule int # Any integer
‘-’? ([1-9] [0-9]* / ‘0’)
{ def obj; eval(text_value); end }
end

rule frac # The fractional part of a floating-point number
‘.’ [0-9]+
end

rule exp # An exponent
[eE] [-+]? [0-9]+
end

rule s # Any amount of whtespace
[ \t\n\t]*
end

end

bbazzarrakk · February 4, 2008, 2:45am

Eric M. wrote:

It was only about 6800
chars/second which is near the bottom of the heap. Not sure how valid that
is since the parser has problems though.

I’m not surprised. Treetop has been designed to be super-clean and pure,
not fast. It’ll get faster. Myself, I think it should have real Regex’s
at
the leaves, and that’d make a heap of difference.

Also, MRI is glacially slow at creating objects, which Treetop does a
lot
of (in this case, for every char in a string for example). If it used
Structs,
or if it used Rubinius, things would be a lot quicker.

One other addition we’re considering is to add skip rules, which don’t
build
nodes, for things like white-space and punctuation, which would
basically
double the speed.

Consider also that the backtracking allows you to handle grammars that
are
simply impossible with traditional techniques… like I’m doing with CQL
in ActiveFacts.

Clifford H…

bbazzarrakk · February 4, 2008, 4:46am

that
is since the parser has problems though.

Hey Eric,

What do you mean it mismatches? Also what do you mean by
self-checking?
It passes all the unit tests for me.

I just found out why you probably had problems with my solution on the
benchmark. You need to call .value on the object returned by the parser
to
get the actual parsed json value.

 t = b.report("#{depth} #{l}") { tree2 = parser.parse(s) }

–>
t = b.report("#{depth} #{l}") { tree2 = parser.parse(s).value }

steve

bbazzarrakk · February 4, 2008, 2:15am

Here is my solution, which, like steve’s, uses Treetop. This is my
first attempt at using Treetop, so I’m not certain how good the
grammar file that I created is. steve’s grammar file is smaller than
mine is.

When I benchmarked my code, I’m getting a parsing performance of
around 10,000 chars/second, much lower than many of the other
solutions.

The two components of the solution are below – a small Ruby class
(that acts as an interface b/w James’ unit testing and what Treetop
produces) and a Treetop grammar file.

Eric

====

Here’s the small Ruby class:

A solution to RubyQuiz #155.

Takes a JSON string and parses it into an equivalent Ruby value.

See Ruby Quiz - Parsing JSON (#155) for details.

The latest version of this solution can also be found at

http://learnruby.com/examples/ruby-quiz-155.shtml .

require ‘rubygems’
require ‘treetop’

Treetop.load ‘json’

class JSONParser
def initialize
@parser = JSONHelperParser.new
end

def parse(input)
result = @parser.parse(input)
raise “could not parse” if result.nil?
result.resolve
end
end

====

And here’s the Treetop grammar:

Treetop grammar for JSON. Note: it doesn’t handle Unicode very

well.

grammar JSONHelper
rule value
spaces v:(simple_literal / string / number / array / object)
spaces {
def resolve
v.resolve
end
}
end

rule spaces
[ \t\n\r]*
end

SIMPLE LITERALS

rule simple_literal
(“true” / “false” / “null”) {
def resolve
case text_value
when “true” : true
when “false” : false
when “null” : nil
end
end
}
end

NUMBERS

rule number
integer fractional:fractional? exponent:exponent? {
def resolve
if fractional.text_value.empty? && exponent.text_value.empty?
integer.text_value.to_i
else
text_value.to_f
end
end
}
end

rule integer
“-”? [1-9] digits
/
# single digit
“-”? [0-9]
end

rule fractional
“.” digits
end

rule exponent
[eE] [-+]? digits
end

rule digits
[0-9]*
end

STRINGS

rule string
“""” {
def resolve
“”
end
}
/
“"” characters:character* “"” {
def resolve
characters.elements.map { |c| c.resolve }.join
end
}
end

rule character
# regular characters
(!“"” !“\” .)+ {
def resolve
text_value
end
}
/
# escaped: \, ", and /
“\” char:(“"” / “\” / “/”) {
def resolve
char.text_value
end
}
/
# escaped: \b, \f, \n, \r, and \t
“\” char:[bfnrt] {
def resolve
case char.text_value
when ‘b’ : “\b”
when ‘f’ : “\f”
when ‘n’ : “\n”
when ‘r’ : “\r”
when ‘t’ : “\t”
end
end
}
/
# for Unicode that overlay ASCII values, we just use the ASCII
value
“\u00” digits:(hex_digit hex_digit) {
def resolve
str = " "
str[0] = digits.text_value.hex
str
end
}
/
# for all other Unicode values use the null character
“\u” digits1:(hex_digit hex_digit) digits2:(hex_digit hex_digit)
{
def resolve
str = " "
str[0] = digits1.textvalue.hex
str[1] = digits2.textvalue.hex
str
end
}
end

rule hex_digit
[0-9a-fA-F]
end

ARRAYS

rule array
“[” spaces “]” {
def resolve
Array.new
end
}
/
“[” spaces value_list spaces “]” {
def resolve
value_list.resolve
end
}
end

rule value_list
value !(spaces “,”) {
def resolve
[ value.resolve ]
end
}
/
value spaces “,” spaces value_list {
def resolve
value_list.resolve.unshift(value.resolve)
end
}
end

OBJECTS

rule object
“{” spaces “}” {
def resolve
Hash.new
end
}
/
“{” spaces pair_list spaces “}” {
def resolve
pair_list.resolve
end
}
end

rule pair_list
pair !(spaces “,”) {
def resolve
{ pair.resolve[0] => pair.resolve[1] }
end
}
/
pair spaces “,” spaces pair_list {
def resolve
hash = pair_list.resolve
hash[pair.resolve[0]] = pair.resolve[1]
hash
end
}
end

rule pair
string spaces “:” spaces value {
def resolve
[ string.resolve, value.resolve ]
end
}
end
end

bbazzarrakk · February 4, 2008, 5:26am

On Feb 3, 2008 9:45 PM, steve [email protected] wrote:

chars/second which is near the bottom of the heap. Not sure how valid
benchmark. You need to call .value on the object returned by the parser
to
get the actual parsed json value.
t = b.report("#{depth} #{l}") { tree2 = parser.parse(s) }
→
t = b.report(“#{depth} #{l}”) { tree2 = parser.parse(s).value }

steve

Thanks Steve,

That was it. I didn’t make a wrapper class like I did for the other
Treetop
parsers. Your parser works without problems now.

Also, I grabbed Eric I’s and Clifford H.'s Treetop parsers and
benchmarked them. Here’s what I found:

ch/s author/gem

```
  Pawel R. (RE, mismatch)
```

3214 Justin E. (RE lexer + parser, fixed number parsing)
4054 Eric M. (Grammar0, no lexer, no parser generation)
4078 Eric I (Treetop, unicode broken)
6534 oksteev (Treetop, mismatches in benchmark)
8313 Clifford H. (Treetop, had to remove handling of “/”)
54586 Eric M. (Grammar, no lexer, v0.5)
166041 Thomas Link (RE, ruby 1.9 results)
220289 json
223486 Eric M. (Grammar, no lexer, unreleased)
224823 fjson (uses C extensions)
553081 Eric M. (Grammar, no lexer, unreleased, w/ ruby2cext)
1522250 json (w/ C extensions)

bbazzarrakk · February 4, 2008, 5:15am

Here are some extra unit tests i wrote. Let me know if you think any of
them are incorrect (per the spec):

def test_more_numbers
assert_equal(5, @parser.parse(“5”))
assert_equal(-5, @parser.parse("-5"))
assert_equal 45.33, @parser.parse(“45.33”)
assert_equal 0.33, @parser.parse(“0.33”)
assert_equal 0.0, @parser.parse(“0.0”)
assert_equal 0, @parser.parse(“0”)
assert_raises(RuntimeError) { @parser.parse("-5.-4") }
assert_raises(RuntimeError) { @parser.parse(“01234”) }
assert_equal(0.2e1, @parser.parse(“0.2E1”))
assert_equal(42e10, @parser.parse(“42E10”))
end

def test_more_string
assert_equal(“abc\befg”, @parser.parse(%Q{“abc\befg”}))
assert_equal(“abc\nefg”, @parser.parse(%Q{“abc\nefg”}))
assert_equal(“abc\refg”, @parser.parse(%Q{“abc\refg”}))
assert_equal(“abc\fefg”, @parser.parse(%Q{“abc\fefg”}))
assert_equal(“abc\tefg”, @parser.parse(%Q{“abc\tefg”}))
assert_equal(“abc\efg”, @parser.parse(%Q{“abc\\efg”}))
assert_equal(“abc/efg”, @parser.parse(%Q{“abc\/efg”}))
end

def test_more_object_parsing
assert_equal({‘a’=>2,‘b’=>4}, @parser.parse(%Q{{ “a” : 2 , “b”:4
}}))
assert_raises(RuntimeError) { @parser.parse(%Q{{ “a” : 2, }}) }
assert_raises(RuntimeError) { @parser.parse(%Q{[ “a” , 2, ]}) }
end

bbazzarrakk · February 4, 2008, 6:46am

Eric M. wrote:

I’m biased,

First up, it sounds like your package is a significant achievement,
and I don’t want anything here to imply I’m not commending you on
it. But different approaches suit different goals…

but I don’t think Treetop is super-clean and pure. Take a look
at my rubyforge grammar package if you want something really clean and pure:

To me, to use a separate grammar is more pure than to use
a Ruby DSL. I can see how you would argue the opposite however…
Your example parsers are much harder to read than Treetop however,
compare:
space = Grammar::Element[Set[?\ ,?\t].duck!(:==,:include?)].discard
with
skip space
[ \t]
end

(ok, so TT doesn’t have a skip keyword yet, soon…)
That’s what I mean by clean and pure anyhow. Not pure Ruby,
but pure grammar.

Nathan’s concept is that “grammar” should become a Ruby keyword,
and should introduce the grammar meta-grammar, rather than using
existing Ruby syntax. I think that polyglot approach is much better
than the DSL approach, since the Ruby grammar isn’t well-suited to
many important languages.

The point is, if/when Ruby uses a parser whose grammar is composable
with a DSL grammar you define, the result is truly seamless and
the syntax is unrestrained. Existing DSL syntax imposes too many
restraints. A language with a composable grammar would be truly
excellent!

specify language grammars in a BNF-like ruby DSL (no new grammar language
to deal with!)

I think I’ve argued why that is a bad thing :-), though it definitely
has good things about it also - like your filters etc…

Ruby DSL is far too awkward to express CQL - which is my primary
project at present.

integration of lexer/parser with other code is seamless since everything
is ruby

Nice, but it won’t help me when I need to target C# or Java.
Most important grammars need to be compilable to more than
one target language. Treetop’s not there yet, but it’s easy
to see how to do it, whereas it won’t ever happen with yours
AFAICS.

(sub)grammars are objects and you combine them using operators and
methods

Can’t get much easier than just “include otherGrammar”, which
works in Treetop (mix the generated modules together).

complete API can be described in only a little more than a hundred lines

Similarly to Treetop, AFAICS.

infinite backtracking (not by default because of performance and
functionality, you specify what needs it)

Does this memoize, or give exponential performance? Or is it optional?

can make any kind of pipeline of lexers/parsers and multi-thread them.

even with its pure ruby/no-regexp lexers/parsers, it can go head-to-head
against Regexp-based stuff. The flexibility of Regexp is limited, so I
don’t see the point since ‘grammar’ gets enough performance.

don’t need to read the file into a string to parse it

and more

Very cool stuff.

With ‘grammar’ you have complete control of the parsing result. Without
actions, the output stream is a copy of the input stream. You can group
things and discard things efficiently. One extreme would be to interpret
the input on the fly and not build anything.

Yes, this is excellent, and something I’ve always wanted from a parser
generator. You should add your json parsers to the examples in the gem!

Clifford H…

bbazzarrakk · February 4, 2008, 5:54am

On Feb 3, 2008 7:44 PM, Clifford H. [email protected] wrote:

Eric M. wrote:

It was only about 6800
chars/second which is near the bottom of the heap. Not sure how valid
that
is since the parser has problems though.

I’m not surprised. Treetop has been designed to be super-clean and pure,
not fast. It’ll get faster. Myself, I think it should have real Regex’s at
the leaves, and that’d make a heap of difference.

I’m biased, but I don’t think Treetop is super-clean and pure. Take a
look
at my rubyforge grammar package if you want something really clean and
pure:

specify language grammars in a BNF-like ruby DSL (no new grammar
language
to deal with!)
- integration of lexer/parser with other code is seamless since
  everything
  is ruby
- (sub)grammars are objects and you combine them using operators and
  methods
complete API can be described in only a little more than a hundred
lines
of code (but the implementation is a lot more since it has lots of
optimizations)
heavily duck-typed such that a lexer or parser (or preprocessor, or
parser
w/o lexer) can specified by the same DSL
infinite backtracking (not by default because of performance and
functionality, you specify what needs it)
can make any kind of pipeline of lexers/parsers and multi-thread them.
even with its pure ruby/no-regexp lexers/parsers, it can go
head-to-head
against Regexp-based stuff. The flexibility of Regexp is limited, so I
don’t see the point since ‘grammar’ gets enough performance.
don’t need to read the file into a string to parse it
and more

Also, MRI is glacially slow at creating objects, which Treetop does a
lot

of (in this case, for every char in a string for example). If it used
Structs,
or if it used Rubinius, things would be a lot quicker.

One other addition we’re considering is to add skip rules, which don’t
build
nodes, for things like white-space and punctuation, which would basically
double the speed.

With ‘grammar’ you have complete control of the parsing result.
Without
actions, the output stream is a copy of the input stream. You can
group
things and discard things efficiently. One extreme would be to
interpret
the input on the fly and not build anything. I have an example tcl
interpreter done in a couple hundred lines of code using ‘grammar’, that
produces no AST - it just interprets while parsing.