Fast text parsing

snacktime · November 7, 2006, 10:48am

Say I’m parsing stuff like http headers, what is going to give better
performance? Strings with regular expressions? StringIO with
readline? splitting strings into arrays on a delimiter? Or is it
going to be so close it’s not really an issue?

Chris

snacktime · November 7, 2006, 10:48am

On Tue, 31 Oct 2006, snacktime wrote:

Say I’m parsing stuff like http headers, what is going to give better
performance? Strings with regular expressions? StringIO with readline?
splitting strings into arrays on a delimiter? Or is it going to be so close
it’s not really an issue?

Chris

if you try to write your regular expressions badly enough they can
surely use
the most cpu

-a

snacktime · November 7, 2006, 10:48am

On Tue, Oct 31, 2006 at 02:14:42PM +0900, [email protected] wrote:

if you try to write your regular expressions badly enough they can surely
use
the most cpu

It all ‘depends’ If you’re doing http header parsing, why not just
use the header parsing in mongrel. It’s already available as a C
extension, probably not going to get much faster than that.

But if you want to stick with the strict ruby parsing, experiment and
see
what works. I was parsing all the netflix[1] data with ruby for fun and
I found out some interesting things about text parsing, at least on my
laptop:

- if you only need the data between two delimiter, it was
  faster to do String#index 2x's and slice the data out of the
  middle vs, split and index into the array

- but, if you had 3 items you wanted out, it was faster to do the
  split.

- for simple parsing, regex's were overkill, but if you want to use
  them make sure to compile them once, use them MANY times

enjoy,

-jeremy

[1] - http://www.netflixprize.com/index

snacktime · November 7, 2006, 10:48am

On 10/30/06, Jeremy H. [email protected] wrote:

if you try to write your regular expressions badly enough they can surely
use
the most cpu

It all ‘depends’ If you’re doing http header parsing, why not just
use the header parsing in mongrel. It’s already available as a C
extension, probably not going to get much faster than that.

I am using it actually, but I"m writing a proxy and I need to parse
the headers the server returns also. I was thinking about just adding
a parser class to the mongrel parser to do this based on the existing
one, still not decided though.