Parsing CSV

Hi guys, im a newbie in Ruby i have to parse two CSV files to compare
2 columns of the given files. My problem is that i tried a lot of
different methods to handle this, i tried to put the entire column in
an array and the other one two then test for the bigger array to make
a loop thought it and compare both files like that. It did not work, i
was thinking in using CSV but its limited and then i came a cross with
fasterCSV which is the module than im stuck right now, if somebody can
make a suggestion i really appreciate it.

Thanks in advance.

PS: I was told to make this tool in Java but, AFAIK Ruby is better for
handling file text.

On Mon, Feb 26, 2007 at 10:50:22PM +0900, Rafael G. wrote:

Hi guys, im a newbie in Ruby i have to parse two CSV files to compare
2 columns of the given files. My problem is that i tried a lot of
different methods to handle this, i tried to put the entire column in
an array and the other one two then test for the bigger array to make
a loop thought it and compare both files like that. It did not work

Well, posting your code might allow someone to help you spot what’s
wrong.

I’d suggest first you check that the two arrays are being read in
properly -
if they are called a1 and a2, then “puts a1.inspect” and “puts
a2.inspect”
will print them to the screen. Then you know whether the problem is in
reading them, or in comparing them.

Posting a more precise description of what you’re trying to do, along
with
some sample data and what output you expect, would also make it easier
for
someone to help you.

PS: I was told to make this tool in Java but, AFAIK Ruby is better for
handling file text.

The better language is the one which you can actually use to get the job
done :slight_smile:

How you do this in Ruby depends on what exactly you mean by ‘compare’,
since
you didn’t define exactly what you’re trying to do. I’m guessing you
mean
check for values which are in the first file but not in the second, or
vice
versa. For a simple solution, have a look at Array#include?

For a more efficient solution, you could first sort the two arrays and
then
walk down them with two pointers i and j. When a1[i] == a2[j] then you
increment both i and j. When a1[i] < a2[j] then you know an item is
missing
in a2, and just increment i. When a1[i] > a2[j] then you know an item is
missing in a1, and just increment j.

Incidentally, you don’t even need Ruby to do this; then shell command
‘join’
can do this for you (as long as you use ‘sort’ to pre-sort your input)

HTH,

Brian.

This code might get you started:

require ‘FasterCSV’

def read_csv(filename)
return FasterCSV::Table.new( FasterCSV.read(filename) ).by_col
end

data1 = read_csv(“data1.csv”)
data2 = read_csv(“data2.csv”)

compare_column_idx = 1
unless data1[compare_column_idx] == data2[compare_column_idx]
puts “column #{compare_column_idx} is different”
end

Regards,
Stephane

passvalues = []
i = 0
IO.foreach(fsource) do |line|
cols = []
cols=CSV::parse_line line.chomp
sourceval = cols[scomp_args[0]] + " " + cols[scomp_args[1]]
IO.foreach(tdest) do |line|
tcols = []
tcols=CSV::parse_line line.chomp
testval = tcols[tcomp_args[0]] + " " + tcols[tcomp_args[1]]
if sourceval == testval
passvalues[i] = sourceval
i += 1
end
end
end

Here is what i got

On Feb 26, 2007, at 11:48 AM, James Edward G. II wrote:

passvalues = FCSV.open(fsource) do |source|
source.select do |row|
allowed.include? row[scomp_args[0]…scomp_args[1]].join(" ")
end
end

The above destroys the field order. If you need to keep the order,
use an Array instead:

allowed = Array.new
FCSV.foreach(dtest) do |row|
allowed << row[scomp_args[0]…scomp_args[1]].join(" ")
end

James Edward G. II

On Feb 26, 2007, at 12:54 PM, James Edward G. II wrote:

end

passvalues = FCSV.open(fsource) do |source|
source.select do |row|
allowed.include? row[scomp_args[0]…scomp_args[1]].join(" ")
end
end

The above destroys the field order.

Sorry, I meant row order.

James Edward G. II

On Feb 26, 2007, at 11:48 AM, James Edward G. II wrote:

tcols=CSV::parse_line line.chomp
passvalues = Array.new
FCSV.foreach(fsource) |s_row|
source = s_row[scomp_args[0]…scomp_args[1]].join(" “)
FCSV.foreach(tdest) |t_row|
if source == t_row[scomp_args[0]…scomp_args[1]].join(” ")
passvalues << source

   break  # performance enhancement
end

end
end

James Edward G. II

Thanks, James and the other guys i think i found the solution for my
problem :slight_smile:

On Feb 26, 2007, at 8:45 AM, Rafael G. wrote:

if sourceval == testval
passvalues[i] = sourceval
i += 1
end
end
end

The direct translation of this code to FasterCSV is:

passvalues = Array.new
FCSV.foreach(fsource) |s_row|
source = s_row[scomp_args[0]…scomp_args[1]].join(" “)
FCSV.foreach(tdest) |t_row|
if source == t_row[scomp_args[0]…scomp_args[1]].join(” ")
passvalues << source
end
end
end

If you can afford to read one of the files into memory because it’s
not too large, you can probably speed that up quite a bit:

require “set”

allowed = Set.new
FCSV.foreach(tdest) do |row|
allowed.add(row[scomp_args[0]…scomp_args[1]].join(" "))
end

passvalues = FCSV.open(fsource) do |source|
source.select do |row|
allowed.include? row[scomp_args[0]…scomp_args[1]].join(" ")
end
end

Hope that gives you some fresh ideas.

James Edward G. II