Remove and merge duplicates in a CSV file

Detlef_R · December 22, 2013, 2:13pm

The data set has duplicate records. There can be more than one
duplicate for the same contact.

How do I remove duplicate records, and build a single record for each
name with maximum possible fields.

Sample input file
FN, LN, phone1, phone2, email, city
Matt, x, 9800000000, , , NYC
Matt, , 9800000001, 8822334490, ,
Matt, x, 9845012345, 9800000000, ,
Matt, , 9800000000, , [email protected], NYC
Matt, x, , 9845012345, [email protected], NYC
Matt, x, 9845012345, 9800000000, , NYC
Matt, y, 9800000001, , , NYC

Sample Output
FN, LN, phone1, phone2, email, city
Matt, x, 9800000000, 9845012345, [email protected], NYC
Matt, y, 9800000001, 8822334490, , NYC

atheeq · December 22, 2013, 6:02pm

Since you can’t only match on first name, you’ll actually end up with 3
entries: Matt x, Matt y, and Matt .

I quickly put this together:

Data in:

require ‘csv’
filename = ‘C:\Users\Joel\Desktop\in.csv’
data = CSV.parse( File.read(filename) )

Tidy up:

data.map! { |row| row.map { |val| val.to_s.strip } }

Create a Hash to hold the data

hash = Hash.new{ |h,k| h[k]={} }

Iterate and write any found values into the first & last name groups

data[1…-1].each do |row|
row[2…-1].each.with_index do |value, index|
unless value.empty?
hash[ row[0…1] ][ data[0][ index+2 ] ] = value
end # unless value empty
end # row values
end # data rows

Output the results as a CSV

outfile = ‘C:\Users\Joel\Desktop\out.csv’
CSV.open( outfile, ‘w’ ) do |csv|

Headers

csv << data[0]

Data

hash.each do |person, details|

# First we write the person's names
csv << [ *person,
  # Then we go through each of the headers
  *data[0][2..-1].map do |header|
    # And find the associated data
    details[ header.to_s.strip ]
  end # map data
] # csv row

end # hash loop
end # CSV