Grep a csv?

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word “Blah1” and throw it in a temporary file what
would be the best approached. My thoughts were to use

require ‘rubygems’
require ‘ruport’
require ‘ruport/util’

t=Ruport::Data:Table.load(“filename.csv”)
t.grep(/Blah1/)

this sadly only returned an output of => []

any ideas?

Thanks!

On Aug 15, 2007, at 22:30 , Michael L. wrote:

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word “Blah1” and throw it in a temporary file what
would be the best approached.

any ideas?

My first thought was FasterCSV and go from there. Any reason not to
use a dedicated CSV lib?

Michael G.
grzm seespotcode net

any ideas?

My first thought was FasterCSV and go from there. Any reason not to
use a dedicated CSV lib?

Michael G.
grzm seespotcode net

im planning to use ruport to graph the data, but if u can integrate the
output from using FasterCSV into a ruport graph, im all ears :slight_smile:

On 8/15/07, Michael L. [email protected] wrote:

this sadly only returned an output of => []

any ideas?

Thanks!

Posted via http://www.ruby-forum.com/.

File.readlines(‘filename.csv’).grep(/Blah1/)

On 8/15/07, Michael L. [email protected] wrote:

im planning to use ruport to graph the data, but if u can integrate the
output from using FasterCSV into a ruport graph, im all ears :slight_smile:

Posted via http://www.ruby-forum.com/.

Ruport uses FasterCSV for its CVS parsing.

M. Edward (Ed) Borasky wrote

t.grep(/Blah1/)
and you’re on Windows. :slight_smile:

Seriously, though, if this is a recurring use case rather
than a one-shot “somebody gave me this *$&%^# file and wants
an answer by 5 PM tonight!” use case, I’d load it into a
database (assuming your database doesn’t have a column count
limitation larger than the column count in your file, that
is) and then hook up to it with DBI. But if it’s a one-shot
deal and you’ve got a command line handy (Linux, MacOS, BSD
or Cygwin)

Windoze has a command-line.

       just do "grep blah1 huge-file.csv >

temp-file.csv". Bonus points for being able to write that in
Ruby and get it debugged before someone who’s been doing
command-line for years types that one-liner in. :slight_smile:

Chris C. has already done it. Have you ever posted
Ruby code here?

M. Edward (Ed) Borasky wrote:

Michael L. wrote:

this sadly only returned an output of => []

any ideas?

Thanks!

OK … first of all, define “huge” and what are your restrictions? Let
me assume the worst case just to get started – more than 256 columns
and more than 65536 rows and you’re on Windows. :slight_smile:

Seriously, though, if this is a recurring use case rather than a
one-shot “somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!” use case, I’d load it into a database (assuming your database
doesn’t have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it’s a
one-shot deal and you’ve got a command line handy (Linux, MacOS, BSD or
Cygwin) just do “grep blah1 huge-file.csv > temp-file.csv”. Bonus points
for being able to write that in Ruby and get it debugged before someone
who’s been doing command-line for years types that one-liner in. :slight_smile:

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database…avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago…what if i did this –

res = []
res << File.readlines(‘filename.csv’).grep(/Blah1/) #thanks chris

ran into a small demeaning problem. this shoves all that grep’d data
into 1 element lol… res[1] => nil …its all shoved into res[0] id
hope to fix this with a simple do statement but a little confusion hit
me while doing that with a readline command. and by shoving this into an
array will i still be able to single out columns of data. if not then
how would i shove the grep data into a second csv file, doing this all
inside ruby of course, no command line program > output.csv :slight_smile:

On 16 Aug 2007, at 13:08, Michael L. wrote:

Let
your file, that is) and then hook up to it with DBI. But if it’s a
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on
any
database…avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago…what if i did this –

res = []
res << File.readlines(‘filename.csv’).grep(/Blah1/) #thanks chris

Array#<< appends the object onto your Array, you want to combine the
two arrays using Array#+:

irb(main):001:0> a = []
=> []
irb(main):002:0> a << [1,2,3]
=> [[1, 2, 3]]
irb(main):003:0> a = []
=> []
irb(main):004:0> a += [1,2,3]
=> [1, 2, 3]
irb(main):005:0>

Though why don’t you just use:

res = File.readlines(‘filename.csv’).grep(/Blah1/)

Alex G.

Bioinformatics Center
Kyoto University

Alex G. wrote:

On 16 Aug 2007, at 13:08, Michael L. wrote:

Though why don’t you just use:

res = File.readlines(‘filename.csv’).grep(/Blah1/)

Alex G.

can i push that into a file to temporarily use to pull all the Blah1
data from, then at the end of the program delete Blah1.csv ?

Michael L. wrote:

this sadly only returned an output of => []

any ideas?

Thanks!

OK … first of all, define “huge” and what are your restrictions? Let
me assume the worst case just to get started – more than 256 columns
and more than 65536 rows and you’re on Windows. :slight_smile:

Seriously, though, if this is a recurring use case rather than a
one-shot “somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!” use case, I’d load it into a database (assuming your database
doesn’t have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it’s a
one-shot deal and you’ve got a command line handy (Linux, MacOS, BSD or
Cygwin) just do “grep blah1 huge-file.csv > temp-file.csv”. Bonus points
for being able to write that in Ruby and get it debugged before someone
who’s been doing command-line for years types that one-liner in. :slight_smile:

From: Michael L. [mailto:[email protected]]

inside ruby of course, no command line program > output.csv :slight_smile:

this is just a simple example,

irb(main):018:0> puts File.readlines(‘filename.csv’)
this, is , a , test, foo
this, is , a , test, bar
this, is , a , test, Blah1
this, is , a , test, bar
this, Blah, is , a , test
this, is , a , Blah, test
=> nil
irb(main):019:0> puts File.readlines(‘filename.csv’).grep(/Blah/)
this, is , a , test, Blah1
this, Blah, is , a , test
this, is , a , Blah, test
=> nil
irb(main):020:0> puts Ruport::Data::Table.new(:data =>
File.readlines(‘filename.csv’).grep(/Blah/).map{|r| r.split(/,/)})
±--------------------------------------+
| this | is | a | test | Blah1 |
| this | Blah | is | a | test |
| this | is | a | Blah | test |
±--------------------------------------+
=> nil
irb(main):021:0>

kind regards -botp

On 16 Aug 2007, at 13:29, Michael L. wrote:

data from, then at the end of the program delete Blah1.csv ?
Sure, use tempfile, but I think botp has shown why you don’t really
need the temporary file (unless there’s part of this problem I’m not
understanding):

irb(main):001:0> puts File.readlines(‘filename.csv’)
this, is , a , test, foo
this, is , a , test, bar
this, is , a , test, Blah1
this, is , a , test, bar
this, Blah, is , a , test
this, is , a , Blah, test
=> nil
irb(main):002:0> puts File.readlines(‘filename.csv’).grep(/Blah1/)
this, is , a , test, Blah1
=> nil
irb(main):003:0> require ‘tempfile’
=> true
irb(main):004:0> tf = Tempfile.new(‘csv’)
=> #<File:/tmp/csv.1339.0>
irb(main):005:0> tf.puts File.readlines(‘filename.csv’).grep(/Blah1/)
=> nil
irb(main):006:0> tf.close
=> nil
irb(main):007:0> tf.open
=> #<File:/tmp/csv.1339.0>
irb(main):008:0> puts tf.gets
this, is , a , test, Blah1
=> nil

Alex G.

Bioinformatics Center
Kyoto University

Alex G. wrote:

On 16 Aug 2007, at 13:29, Michael L. wrote:

data from, then at the end of the program delete Blah1.csv ?
Sure, use tempfile, but I think botp has shown why you don’t really
need the temporary file (unless there’s part of this problem I’m not
understanding):

the reason im going to use a tempfile is because thers going to be a
per’se blah2 blah3 ect. So im storing the grep from blah1, blah2, and
blah3 in different files so i can later pull the data from those files
to graph each line with ruport. each line will represent Blah1 Blah2
Blah3 ect. So unless i can shove the grep output into an array, and each
line be an element, ive gotta use a tempfile. Maybe theres a better way?

If thers a better approached im all ears :smiley:

Thanks a ton btw, you all have been a tremendous help.

Michael L. wrote:

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database…avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago…what if i did this –

res = []
res << File.readlines(‘filename.csv’).grep(/Blah1/) #thanks chris
There’s a problem with using File.readlines that I don’t think anyone’s
mentioned yet. I don’t know if it’s relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,“foo
Blah1”,bar

The only way around that is to actually parse the file, so unless you
know that’s not going to be a problem, I’d still suggest FasterCSV…

On Aug 15, 10:18 pm, Michael L. [email protected] wrote:

blah3 in different files so i can later pull the data from those files
to graph each line with ruport. each line will represent Blah1 Blah2
Blah3 ect. So unless i can shove the grep output into an array, and each
line be an element, ive gotta use a tempfile. Maybe theres a better way?

If thers a better approached im all ears :smiley:

Thanks a ton btw, you all have been a tremendous help.

Posted viahttp://www.ruby-forum.com/.

Not sure if this is what you are looking for, but if you are trying
to search a file placing the lines that contain “Blah1” into a file
called “Blah1.csv”, the lines containg “Blah2” into a file called
“Blah2.csv” etc, Rio might help:

require ‘rio’
rio(‘filename.csv’).chomp.lines(/Blah[^,]*/) do |line,m|
rio(m) + ‘.csv’ << line + $/
end

Then again it might not.

On Aug 16, 1:15 am, Peña, Botp [email protected] wrote:

From: rio4ruby [mailto:[email protected]]

require ‘rio’

rio(‘filename.csv’).chomp.lines(/Blah[^,]*/) do |line,m|

rio(m) + ‘.csv’ << line + $/

end

simply amazing. btw, how does rio handle big files, does it load them whole in memory?

thanks for rio.
kind regards -botp

Never. Examples that assume I have a file small enough to load into
memory irritate me. :slight_smile:

From: rio4ruby [mailto:[email protected]]

require ‘rio’

rio(‘filename.csv’).chomp.lines(/Blah[^,]*/) do |line,m|

rio(m) + ‘.csv’ << line + $/

end

simply amazing. btw, how does rio handle big files, does it load them
whole in memory?

thanks for rio.
kind regards -botp

On Aug 16, 3:04 am, Alex Y. [email protected] wrote:

There’s a problem with using File.readlines that I don’t think anyone’s
mentioned yet. I don’t know if it’s relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,“foo
Blah1”,bar

The only way around that is to actually parse the file, so unless you
know that’s not going to be a problem, I’d still suggest FasterCSV…

I guess the following is slightly OT, since the OP is talking about
grabbing whole lines, but it’s a CSV-relevant question:

Another danger with not using CSV packages (as I’ve learned from my
present non-use of said packages) is that quoted elements in a row can
contain commas. I’m a fairly inexperienced programmer, and not just in
Ruby, and I haven’t yet figured out an elegant way to break this down.

For example:

foo,bar,“foo,bar” is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return [“foo”,“bar”,“"foo”,“bar
"”], an array of size four. What’s an efficient, elegant way of
gathering quoted columns? I’m sure FasterCSV is capable of dealing
with such things, but it’s something I’d like to actually learn,
rather than finding a way around it.

Kaldrenon wrote:

know that’s not going to be a problem, I’d still suggest FasterCSV…

foo,bar,“foo,bar” is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return [“foo”,“bar”,"“foo”,"bar
“”], an array of size four. What’s an efficient, elegant way of
gathering quoted columns?
If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

‘foo,bar,“foo,bar”’.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> [“foo”, “bar”, ““foo,bar””]

That breaks at least for empty fields, fields with newlines, and fields
with ‘"’ in them.

I’m sure FasterCSV is capable of dealing
with such things, but it’s something I’d like to actually learn,
rather than finding a way around it.
That’s just it - the cleanest way to handle this problem is to just take
advantage of someone else’s work. CSV looks just trivial enough to the
human eye that you can convince yourself “Oh, it can’t be that
hard”… but it’s still complicated enough that you get tired of coding
around the fiddly little details very, very quickly. Besides, when a
library is as easy to use as either CSV or FasterCSV (my preference is
for the latter, but it’s not in stdlib), what’s the point in stressing?

On Aug 16, 11:56 am, Alex Y. [email protected] wrote:

If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

‘foo,bar,“foo,bar”’.scan(/(“[^”]+")|([^,]+)/).flatten.compact
=> [“foo”, “bar”, “"foo,bar"”]

That’s very cool, granting that there are still a number of conditions
for which it’ll go bust. But I’m a little confused about how a pattern
like [^,]+ gets an element, given that (unless I’m mistaken) in a
standard regexp, it would only match on a string that contained a
series of commas that beginning of a line, like “,” or “abc\n,”.
What’s my mistake/confusion here?

Besides, when a
library is as easy to use as either CSV or FasterCSV (my preference is
for the latter, but it’s not in stdlib), what’s the point in stressing?

The only point is in learning - I like to be able to use packages
without the effort that went into making them, but I don’t really like
jumping from DIY to “Here’s a way, and don’t worry about how it
works…it’s magic!”

Thanks for the info,
Andrew