– Matma R.
2011/11/16 Intransition [email protected]:
You know what though. I did some benchmarking and discovered that manually
parsing the text line by line is much faster than using a regular expression
(I used a close approx re). I was kind of surprised by this, since the
regular expression engine is written in C, where as my line by line parser
is in Ruby.
We’re talking about this regex, right?
s.scan(/(.*?(\s+)\s+[^\n]+?\n(?=\2\S|\z))/m)
Wel, I wouldn’t be surprised at all that it’s slower. Regex engines
are crazy complicated beasts, and regex-matching itself can be, in
worst case, exponential in complexity (due to backtracking). This one
regex is kind of complicated too; it has multiple nested matching
groups, it has backreferences to them, it has lazy quantifiers, it has
lookahead… this can make it expensive to match, even more so on a
long text.
A naive line-parsing algorithm just has to (as far as my understanding
of what you’re trying to achieve goes) just split the text on
newlines, look for ones starting with whitespace and group the array
items we got when splitting - the entire ordeal has just a linear
complexity, a peace of cake.
Despite that, I still find it curious that there isn’t a more obvious
regular expression for parsing a document in this way. It makes me wonder if
a C.S. PhD could go back to the drawing board, and come up with a better
alternative to REs.
Regexps are hardly ever good for any kind of “parsing”; they were
created for, and are better suited for, pattern matching and
replacing. Here you might be better of with some kind of automated
grammar parser (possibly Perl’s grammars - a Ruby port, anyone? [1]).
Perl guys are also trying to completely reinvent regular expression
syntax for Perl 6, and most of the ideas are really good stuff. [2]
[1] http://en.wikipedia.org/wiki/Perl_6_rules#Grammars
[2] http://dev.perl.org/perl6/doc/design/apo/A05.html (a long read,
but worth it)