Fastest CSV parsing?

williej · August 16, 2007, 9:37pm

This is the best I’ve come up with so far. It should handle any CSV
record
(i.e., fields may contain commas, double quotes, and newlines).

class String
def csv
if include? ‘"’
ary =
“#{chomp},”.scan( /\G"([^"](?:""[^"]))",|\G([^,"]),/ )
raise “Bad csv record:\n#{self}” if $’ != “”
ary.map{|a| a[1] || a[0].gsub(/""/,’"’) }
else
ary = chomp.split( /,/, -1)
## “”.csv ought to be [""], not [], just as
## “,”.csv is ["",""].
if [] == ary
[""]
else
ary
end
end
end
end

williej · August 16, 2007, 9:49pm

On Aug 16, 2007, at 2:35 PM, William J. wrote:

  ary.map{|a| a[1] || a[0].gsub(/""/,'"') }

end
end

You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

James Edward G. II

williej · August 17, 2007, 2:45am

James Edward G. II wrote:

    "#{chomp},".scan( /\G"([^"]*(?:""[^"]*)*)",|\G([^,"]*),/ )
  end
end
end
end
You are pretty much rewriting FasterCSV here. Why do that when we
could just use it instead?

That is a dishonest comment.

What if someone had said to you when you released “FasterCSV”:
“You are pretty much rewriting CSV here. Why do that when we
could just use it instead?”

Parsing CSV isn’t very difficult.
“FasterCSV” is too slow and far too large. People don’t need
to be installing it on their systems when a few lines of code
will do the job.

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won’t be paid any more money.

williej · August 17, 2007, 3:30am

William J. wrote:

What if someone had said to you when you released “FasterCSV”:
“You are pretty much rewriting CSV here. Why do that when we
could just use it instead?”

Point made. However…

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won’t be paid any more money.

From JEG2’s own blog post:

“If your number one concern when working with CSV data in Ruby is raw
speed, you might want to know that FasterCSV is no longer the fastest
option.”

http://blog.grayproductions.net/articles/2007/04/16/no-longer-the-fastest-game-in-town

Your code may or may not be faster – you’ve offered no comparison.
Regardless, I doubt that JEG2 was trying to stifle your efforts; just
suggesting that you may want to avoid reinventing the wheel.

David

williej · August 17, 2007, 4:17am

On Aug 16, 2007, at 7:44 PM, William J. wrote:

if include? '"'
  else
That is a dishonest comment.
Not honest? I guess I’m not sure how you meant that.

FasterCSV’s parser uses a very similar regular expression. Quoting
from the source:

 # prebuild Regexps for faster parsing
 @parsers = {
   :leading_fields =>
     /\A(?:#{Regexp.escape(@col_sep)})+/,     # for empty leading

fields
:csv_row =>
### The Primary Parser ###
/ \G(?:^|#{Regexp.escape(@col_sep)}) # anchor the match
(?: “((?>[^”])(?>""[^"]))" # find quoted fields
| # … or …
([^"#{Regexp.escape(@col_sep)}]) # unquoted fields
)/x,
### End Primary Parser ###
:line_end =>
/#{Regexp.escape(@row_sep)}\z/ # safer than chomp!()
}

I felt they were similar enough to say you were recreating it. I can
live with it if you don’t agree though.

What if someone had said to you when you released “FasterCSV”:
“You are pretty much rewriting CSV here. Why do that when we
could just use it instead?”

They did. I said it was too slow and I didn’t care for the
interface, though some do prefer it. Pretty much what you just said
to me, so I look forward to using your EvenFasterCSV library on my
next project.

Parsing CSV isn’t very difficult.

Yeah, it’s not too tough.

I’m a little bothered by how your solution makes me slurp the data
into a String though. Today I was working with a CSV file with over
35,000 records in it, so I’m not too comfortable with that. You
might consider adding a little code to ease that.

Also, I really prefer to work with CSV by headers, instead of column
indices. That’s easier and more robust, in my opinion. You might
want to add some code for that too.

Of course, then we’re just getting closer and closer to FasterCSV, so
maybe not…

“FasterCSV” is too slow and far too large.

FasterCSV is mostly interface code to make the user experience as
nice as possible. There’s also a lot of documentation in there. The
core parser is still way smaller than the standard library’s parser.

James Edward G. II

williej · August 17, 2007, 4:34am

On Behalf Of David M.:

http://blog.grayproductions.net/articles/2007/04/16/no-longer-

the-fastest-game-in-town

and jeg gave a valuable hint on producing a fast scanner (be it scanning
csv or whatever) —by using the humble and underestimated
Stringscanner…

kind regards -botp

williej · August 17, 2007, 3:47am

William J. wrote:

Why do you want people to be dependent on your slow, bloated
code? You perhaps think that if there is an alternative,
you won’t be paid any more money.

Hmmm…and here I was thinking that FasterCSV was free software. Have
you identified some way that James is making money from it? Have you
identified some way that using FasterCSV is hurtful?

William, there’s no need to be so angry. We’re all here to help each
other.

williej · August 17, 2007, 4:38am

Hi,

Am Freitag, 17. Aug 2007, 09:44:58 +0900 schrieb William J.:

That is a dishonest comment.
Coding is a kind of sports to me. Besides that it is not my
decision what you do with your time.

I often recode things that are already written and
well-proofed. Sometimes my code is better, sometimes I learn
from comparing it. Sometimes it brings just new ideas to me.

So go on and do what you like; I think that’s still the main
purpose of open source. Though it might not have been an
actually valuable contribution to this list I had fun
reading your solution.

Bertram

williej · August 20, 2007, 10:06am

(i.e., fields may contain commas, double quotes, and newlines).
well-proofed. Sometimes my code is better, sometimes I learn
Bertram S.
Stuttgart, Deutschland/Germanyhttp://www.bertram-scharpf.de

Just a pointer to yet another CSV parsing regex:
http://snippets.dzone.com/posts/show/4430

Cheers,

b.