Emmanuel [email protected] writes:
Regular Expression Matching Can Be Simple And Fast
I became a little worried since i’m making hevy use of regexes in a
program of mine that shortly i’ll have to run on each of thousands of
text files.
I don’t know about proposed plans for the regexp engine for ruby, but I
would
say not to be overly concerned at this stage.
From reading that article, one thing that I noted was that even the
author
acknowledged that the performance of regular expressions is
significantly
affected by how the regexp are defined. If you keep in mind the basic
concepts
and create your regexp accordingly, you will likely get performance
differences
that even outweigh the differences between the two approaches outlined
in that
article - or putting it another way, poorly specified RE will perform
badly
regardless of the algorithm used. What is important is to do things like
anchoring, using as precise specification as possible and taking
advantage of
any knowledge regarding the data you are processing.
I’ve not done large (ie. gigabytes of data) processing with RE under
ruby, but
I have done so with perl and the performance was quite acceptable.
There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your
task may
fall well within your performance requirements.
The other way to look at this is to consider what you would do/use as an
alternative. I’ve also used a lot of Tcl and according to that article,
Tcl
uses the faster algorithm, yet I’ve never really noticed any great
performance
difference between using Perl or Tcl. So, your choice is to continue and
see if
there is a problem and then deal with that if/when it occurs or jump now
and
start writing your program in Tcl, awk or using grep (maybe even call
grep from
within ruby, but I suspect all performance gains would be lost in
passing the
data between ruby and the grep process).
I’ve seen many people make a decision regarding the choice of technology
because they have read somewhere that x is faster than y. Often, I’ve
then seen
something created which is flakey, takes 10x as long to develop or
simply
doesn’t work when in reality, the aspect they were concerned about
wasn’t even
relevant to the situation they were working in. Recently, I had an
argument
with one of our sys admins who wanted to use the Riser FS rather than
Ext3 as
the file system on a new server. His argument was that Riser had better
performance characteristics and the system would perform better. My
argument
was that file system performance was not a significant bottleneck for
the
server and we would be better off sticking with a file system that had a
better
track record, more robust tools and represented a technology more sys
admins
were familiar with. I lost the argument, initially. The server was
configured
with RiserFS and after a few months, we came in one morning to find
massive
disk curruption problems. The server was changed to Ext3 and has not
missed a
beat since. More to the point, the performance using Ext3 is still well
within
acceptable performance metrics. My point isn’t that RiserFS may not be a
good
file system - it probably is and possibly is even “better” than Ext3. My
point
is that speed is not the only issue to consider.
Something which the article doesn’t address is complexity and
correctness. The
algorithm used by Perl et. al. may not be fast compared to the
alternative, but
it is relatively simple. As important to performance is correctness. An
extremely fast RE is not a better solution if it is only correct 95% of
the
time or is so complex, it is difficult to maintain without bugs creeping
in
after version updates etc.
Tim