Fast text parsing

Say I’m parsing stuff like http headers, what is going to give better
performance? Strings with regular expressions? StringIO with
readline? splitting strings into arrays on a delimiter? Or is it
going to be so close it’s not really an issue?

Chris

On Tue, 31 Oct 2006, snacktime wrote:

Say I’m parsing stuff like http headers, what is going to give better
performance? Strings with regular expressions? StringIO with readline?
splitting strings into arrays on a delimiter? Or is it going to be so close
it’s not really an issue?

Chris

if you try to write your regular expressions badly enough they can
surely use
the most cpu :wink:

-a

On Tue, Oct 31, 2006 at 02:14:42PM +0900, [email protected] wrote:

if you try to write your regular expressions badly enough they can surely
use
the most cpu :wink:

It all ‘depends’ :slight_smile: If you’re doing http header parsing, why not just
use the header parsing in mongrel. It’s already available as a C
extension, probably not going to get much faster than that.

But if you want to stick with the strict ruby parsing, experiment and
see
what works. I was parsing all the netflix[1] data with ruby for fun and
I found out some interesting things about text parsing, at least on my
laptop:

- if you only need the data between two delimiter, it was
  faster to do String#index 2x's and slice the data out of the
  middle vs, split and index into the array

- but, if you had 3 items you wanted out, it was faster to do the
  split.

- for simple parsing, regex's were overkill, but if you want to use
  them make sure to compile them once, use them MANY times

enjoy,

-jeremy

[1] - http://www.netflixprize.com/index

On 10/30/06, Jeremy H. [email protected] wrote:

if you try to write your regular expressions badly enough they can surely
use
the most cpu :wink:

It all ‘depends’ :slight_smile: If you’re doing http header parsing, why not just
use the header parsing in mongrel. It’s already available as a C
extension, probably not going to get much faster than that.

I am using it actually, but I"m writing a proxy and I need to parse
the headers the server returns also. I was thinking about just adding
a parser class to the mongrel parser to do this based on the existing
one, still not decided though.