Benchmarking the pure-ruby parser

jasongarber · June 11, 2009, 4:57pm

Benchmarking the pure-ruby version of the parser, which is a new
option in 4.2…

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 4.2.0 compiled in Ruby…
Finished in 30.703538 seconds

30 seconds is a long time compared to…

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 3.0.4 compiled in ruby-regex…
Finished in 0.346966 seconds

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 4.0.0 compiled in C…
Finished in 0.055147 seconds

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 4.1.1 compiled in C…
Finished in 0.05756 seconds

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 4.1.9 compiled in C…
Finished in 0.140867 seconds

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 4.2.0 compiled in C…
Finished in 0.13385 seconds

Uh, yeah. I knew it was slow, but until now I had no idea it was
that slow compared to 3.0.4. The tests in 3.0.4 took a long time to
run, so I just figured the two all-ruby versions would be about the
same. Now that I think about it, it does make sense that even lots of
Regexes would be faster than doing all the looping, character
comparison, and concatenation in Ruby.

Also note that RedCloth has been getting slower with each version,
except this last one where I tried hard to reduce complexity in some
areas. Fixing bugs usually means making the machine more specific,
which means more complexity. The parser binary has more than doubled
in size since 4.0.0.

Let’s see about Ruby 1.8 vs 1.9…

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 3.0.4 compiled in ruby-regex…
Finished in 0.320067 seconds

~/Documents/redcloth(master) $ spec19 spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 3.0.4 compiled in ruby-regex…
Finished in 0.566467 seconds

So 3.0.4 is slower in Ruby 1.9. Interesting. Is 4.2.0?

~/Documents/redcloth(master) $ spec spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 4.2.0 compiled in C…
Finished in 0.148924 seconds

~/Documents/redcloth(master) $ spec19 spec/benchmark_spec.rb -O spec/
spec.opts
Benchmarking version 4.2.0 compiled in C…
Finished in 0.107696 seconds

Thankfully, no.

My conclusion is, I need to do something different with RedCloth in
the long term if I want it available as a pure-Ruby library and if I
don’t want the size and parse time to keep ballooning.

Jason

jasongarber · June 11, 2009, 6:44pm

Thanks for the very interesting comparaisons Jason. Pure ruby could
definitely be an option.

As I said, I really need a good, ruby extendable textile parser for
zena in the long run so I’d be glad to help. I have implemented
several pure ruby (regex based) parsers (parser for
http://zenadmin.org/zafu is one) and other ragel based (json/scripts
for http://rubyk.org for example).

Regarding 3.0.4: I see most regexp are not left anchored (starting
with /\A…/). I think we could get a nice speedup (and a more robust
parser) if we used some kind of state machine instead of raw split and
gsub. The zafu parser for example uses contextual regexps and eats
input data as it matches. This means that input data is only parsed a
minimal amount of times.

Gaspard

jasongarber · June 11, 2009, 7:57pm

On Jun 11, 2009, at 1:24 PM, Hugh S. wrote:

Is the ruby code to do this in the RedCloth-4.2.0 distro? I can’t see
any parsers in there (grep -lir parser .) except the C code.

You have to get the source code yourself and build it with rake
pureruby compile. I didn’t include it in any package/distro because I
want to discourage its use (it is 100x slower, after all!).

jasongarber · June 11, 2009, 8:00pm

Gaspard, thanks for offering to help! I’ll surely take you up on it—
seems like you have some very helpful experience. Want to do a little
more research on it myself first, though. Then maybe you and I can
get together and work out some prototypes.

Jason

jasongarber · June 11, 2009, 7:25pm

Is the ruby code to do this in the RedCloth-4.2.0 distro? I can’t see
any parsers in there (grep -lir parser .) except the C code.

I’m wondering, much of the work of a parser will be looking for
fixed tokens, won’t it? I know for numbers and arbitrary strings
this isn’t true… I ask because one of the few weaknesses in Ruby
is that there is an over-emphasis on Regexps for processing strings
in the language, and the only way I know to test whether some string
(str) begins with a substring (sought), other than regexps, is
str.index(sought).zero? . Have I missed something obvious? It seems
to me that constructing something as complex as a regexp for such an
operation would be expensive. Even str.scan which will accept a string
rather than a regexp seems to treat such a string as a regexp, from a
quick look at the c code.

    Hugh

jasongarber · June 11, 2009, 8:26pm

For benchmarking, I think it could make sense to have some
representative data of a typical application. This might mean large
amounts of pure text without any textile tags (except paragraphs), or
it might be a mixture of very short texts with a few very long ones. I
could give you the data from http://zenadmin.org, but the links and
images are not very typical:

“”:45 (internal link to node 45)

Want to do a little more research on it myself first, though.

Of course ! Please continue to give us feedback. This kind of
experimenting can provide very useful information on the different
parsing solutions.

G.

jasongarber · June 11, 2009, 8:27pm

On Thu, 11 Jun 2009, Jason G. wrote:

On Jun 11, 2009, at 1:24 PM, Hugh S. wrote:

Is the ruby code to do this in the RedCloth-4.2.0 distro? I can’t see
any parsers in there (grep -lir parser .) except the C code.

You have to get the source code yourself and build it with rake pureruby
compile. I didn’t include it in any package/distro because I want to
discourage its use (it is 100x slower, after all!).

Ok, just wondered if I could see any speedups in a few mins. I’ll leave
this for now. I will have a bit more time in a few weeks I think.

   Hugh

jasongarber · June 12, 2009, 12:15am

Hugh S. wrote:

to me that constructing something as complex as a regexp for such an
operation would be expensive. Even str.scan which will accept a string
rather than a regexp seems to treat such a string as a regexp, from a
quick look at the c code.
    Hugh

Using index() to check the beginning of the string will be very slow,
since will search the entire string for the substring, although you are
only interested in the beginning of the string. I would suggest
str[0,sought.length] == sought, which is considerably faster, especially
for short substrings. For longer ones, it seems (from some quick
benchmarking) that each approach is about equal.

-Justin

require ‘benchmark’
include Benchmark

str = “adiaodibasd” * 1000

sought = “hello”

bmbm do |t|
t.report “Index” do
100000.times do
str.index(sought) == 0
end
end

t.report "Regexp" do
    reg = /^#{sought}/
    100000.times do
        str =~ reg
    end
end

t.report "Slice" do
    100000.times do
        str[0,sought.length] == sought
    end
end

end

jasongarber · June 16, 2009, 10:03pm

Gaspard (and others who are interested),

I’ve been researching some tools in this domain to figure out what
they’re good at and what they’re not. I like the advantages of
parsing expression grammars (PEGs) over regular expressions so I
started a test rewrite with Treetop to find out what the gotchas are
(there are always gotchas). I’ve hit surprisingly few and mostly they
were indicative that I was thinking about the problem wrongly.
Overall it’s going great, it’s fast (so far) and very elegant. I can
get down and spec individual parts of the grammar or even individual
rules!

I’m interested to see what you would come up with using contextual
regexps like you mentioned for the zafu parser. I’d welcome you to
build a little prototype like I’m building with Treetop
(GitHub - jgarber/redcloth-treetop: Just playing around with treetop; seeing how well/fast it would work for RedCloth.
). If yours implements the same subset of Textile, then we can do
some benchmarking, compare the grammars and so forth. Don’t feel
obligatedâ€”just know that I’m open-minded about it.

I think I’d really like to stay away from C/Java unless someone
demonstrates that it’s the only way to parse with any speed. My
experience so far with Treetop indicates a Ruby parser can be fast
enough (and if speed were really more important to us than some other
factors, we wouldn’t have picked Ruby in the first place!).

Either way, I also agree that it needs to be durable over the long-
term. I’m making websites in Textile right and left and I can’t be
updating them every time the parser changes. I think lots of unit
tests are the only sure way to prevent regression (like happened in
4.2.0) and I’ve been very pleased with how I can spec the behavior of
individual rules and grammars in my treetop prototype.

So, if you want to start a prototype regexp parser, that’s cool, or if
you like what you see on my prototype project, you’re welcome to join
that as well.

Regards,
Jason

jasongarber · June 16, 2009, 10:46pm

Yes, you can use Treetop for the interpreter pattern out of the box,
but I’m doing following the visitor pattern like Cucumber did. Just
started working on the visitor, actually. And my posts to the treetop
list are a little old—I’ve moved past the problems I described there—
so don’t read them as the current state of things.

Hope I’m using all the terminology right. I’m new to all this (no
CompSci background).

jasongarber · June 16, 2009, 10:31pm

I followed some of the discussions on the treetop mailing list and it
sure looks great. I’ll have a look at the subset you are parsing and
if that’s not too much, I’ll try an Oniguruma based parser.

From what I understand, Treetop builds an abstract syntax tree with
rendering methods attached. This means that the parser knows about
everything there is to know about parsing and rendering (Interpreter
pattern).

I’ll try to build something that could be totally agnostic of what we
actually do with the AST (or S-expression), may it be pdf, html, latex
or other SM gears (Visitor pattern).

Gaspard

jasongarber · June 17, 2009, 10:14am

Hi Hugh !

I went through a description of the top down parser using Python
(http://effbot.org/zone/simple-top-down-parsing.htm) and it’s funny
how we keep reinventing the wheel: I just implemented something
similar for querybuilder, using a ruby array as stack and poping the
stack depending on operator precedence (http://tinyurl.com/mgj6et).

Using operator precedence could be an interesting solution to solve
inherent textile ambiguities but it does not remove the curse of
tokenizing.

I’ll keep this operator precedence idea in mind when writing a regexp
based prototype.

G.

jasongarber · June 17, 2009, 1:06am

It sounds like you are not looking for alternative techniques any
more. That’s good, but I’ve found Top Down Operator Precedence
parsers, as explained in Chapter 9 of “Beautiful Code” [Oram and
Wilson, O’Reilly] particularly easy to understand, and relatively OK
to debug as well. The chapter is an implementation and explanation
of a paper by Vaughan Pratt in 1973, but it’s an ACM paper, so access
to it may be a problem.

    HTH
    Hugh