Merging partial, incomplete data files

bterkuile · October 21, 2009, 11:20pm

Hello all,

I write scripts for fun and am still a noob.

I’d like to use ruby to solve a problem for me but find it difficult
this time so I thought I’d ask for help.

It’s very basic in fact.

I have 2 csv files I’d like to merge such as :

File1 File2

ref;qty ref;price
A;10 A;100
B;20 D;150
C;30 C;200
E;5 B;75

outputs to File3 =>

File3

ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;“missing_data”;150;“missing_data”
E;5;“missing_data”;“missing_data”

Here is the code I have so far:

###########################################################################
def process_file(file)
processed = []
opened = File.open(file, “r”).readlines

opened.each {|line| processed << line.strip.split(";")}
return processed
end

arr1 = process_file(“file1”)
arr2 = process_file(“file2”)

p arr1, arr2
###########################################################################

==> [[“ref”, “qty”], [“A”, “10”], [“B”, “20”], [“C”, “30”], [“E”, “5”]]
==> [[“ref”, “price”], [“A”, “100”], [“D”, “150”], [“C”, “200”], [“B”,
“75”]]

So I have 2 arrays which is great but I’m not sure on how to go about
merging all this since data is sometimes missing from file1, some other
times from file2. Of course those 2 files are more complex (and
bigger!), and I’d like to setup a mechanism that would allow me to
select which columns to filter and merge, but that’s the basic idea
anyway.

Thanks for your input and direction.

bterkuile · October 21, 2009, 11:55pm

On 10/21/2009 11:20 PM, Benjamin T. wrote:

File3
###########################################################################

times from file2. Of course those 2 files are more complex (and
bigger!), and I’d like to setup a mechanism that would allow me to
select which columns to filter and merge, but that’s the basic idea
anyway.

Thanks for your input and direction.

I’d say, you should look into classes Hash and Set. If you have both
files in a Hash each you can do

keys = f1.keys.to_set | f2.keys

Then you can do

keys.each do |key|

fetch from f1 and f2

end

Kind regards

robert

bterkuile · October 22, 2009, 12:09am

Hi Benjamin,

This can be done in ruby, and my inline responses will hopefully point
you in
the right direction. However, depending on the size of the data or the
amount
of available memory, doing this in ruby may not be the best choice.
If I were doing this, I would probably import both files into a database
like sqlite3, using a script, of course, and generate the output with
sql.
The front end script would accept parameters for the input files, the
columns
to output (or a list of columns to suppress, if any) and produce the
desired
results.

Rather than using generating two arrays, I would read one file
into a hash, keyed by the common key, ‘ref’, then read the second file
and match up its record, if any, in in other file, and output the
combined
set. You’d want to keep track of which lines from file1 have a match so
you
can output all the non-matching values. However, I would strongly
suggest
that rather than doing the work yourself, generate code to script sqlite
or other database system.

HTH

-Gyepi

bterkuile · October 21, 2009, 11:57pm

On Oct 21, 2009, at 4:20 PM, Benjamin T. wrote:

Hello all,

Hello.

I write scripts for fun and am still a noob.

I don’t think I wrote code as good as you when I was new.

outputs to File3 =>

File3

ref;qty;price;total
A;10;100;1000
B;20;75;1500
C;30;200;6000
D;“missing_data”;150;“missing_data”
E;5;“missing_data”;“missing_data”

So I have 2 arrays which is great but I’m not sure on how to go about
merging all this since data is sometimes missing from file1, some
other
times from file2.

It’s a tricky question.

There are a few ways I can’t of to do things.

First, you could load one giant Hash (indexed by ref) of Arrays or
Hashes with the data as you read it, then just dump the Hash to disk.
This is the easiest, but it requires you to have all data in memory at
once. Are the files small enough that you can do that?

James Edward G. II

bterkuile · October 23, 2009, 12:42am

James Edward G. II wrote:

First, you could load one giant Hash (indexed by ref) of Arrays or
Hashes with the data as you read it, then just dump the Hash to disk.
This is the easiest, but it requires you to have all data in memory at
once. Are the files small enough that you can do that?

Yes, I don’t think that would be a problem. This is what I have so far :

#############################################################################
def process_file(file)
header = []
processed = []
opened = File.open(file, “r”).readlines # store in
array – index=line

opened.each {|line| processed << line.strip.split(";")} # remove
separator and “\n”
header = processed.shift
return header, processed # the array
is clean
end

def build_hash(header, arr, col_pos)
hash = {}
arr.size.times {|counter|
hash["#{arr[counter][0]}_#{header[col_pos]}"]=arr[counter][col_pos] }
return hash
end

hdr1, arr1 = process_file(“file1”)
hdr2, arr2 = process_file(“file2”)

p hdr1, arr1
puts
p hdr2, arr2

hash1 = build_hash(hdr1, arr1, 0)
hash2 = build_hash(hdr1, arr1, 1)

hash3 = build_hash(hdr2, arr2, 0)
hash4 = build_hash(hdr2, arr2, 1)
puts

p hash1, hash2
puts
p hash3, hash4

puts
hash = {}
merge1 = hash1.merge(hash2)
merge2 = hash3.merge(hash4)
hash = merge1.merge(merge2)

p hdr1 + hdr2
p hash
#########################################################################
Outputs to :
[“ref”, “qty”]
[[“A”, “10”], [“B”, “20”], [“C”, “30”], [“E”, “5”]]

[“ref”, “price”]
[[“A”, “100”], [“D”, “150”], [“C”, “200”], [“B”, “75”]]

{“A_ref”=>“A”, “C_ref”=>“C”, “B_ref”=>“B”, “E_ref”=>“E”}
{“A_qty”=>“10”, “C_qty”=>“30”, “B_qty”=>“20”, “E_qty”=>“5”}

{“A_ref”=>“A”, “C_ref”=>“C”, “B_ref”=>“B”, “D_ref”=>“D”}
{“B_price”=>“75”, “A_price”=>“100”, “D_price”=>“150”, “C_price”=>“200”}

[“ref”, “qty”, “ref”, “price”]
{“B_price”=>“75”, “A_ref”=>“A”, “A_price”=>“100”, “C_ref”=>“C”,
“D_ref”=>“D”, “B_ref”=>“B”, “A_qty”=>“10”, “D_price”=>“150”,
“C_price”=>“200”, “C_qty”=>“30”, “E_ref”=>“E”, “B_qty”=>“20”,
“E_qty”=>“5”}
############################################################################
Robert K. wrote:

I’d say, you should look into classes Hash and Set. If you have both
files in a Hash each you can do…

Thanks, I will look into “Set” which I don’t know about.

Gyepi SAM wrote:

However, I would strongly suggest that rather than doing the work yourself, > generate code to script sqlite or other database system.

Yes, I have thought about that. I would guess it would be the most
efficient solution in terms of ressources and speed. However I don’t
mind taking the time and making an effort to understand data
manipulation.

Rather than using generating two arrays, I would read one file
into a hash, keyed by the common key, ‘ref’, then read the second file
and match up its record, if any, in in other file, and output the
combined

Is what I’ve done similar to what you had in mind ?

Thanks all for the input.