Alphanumeric data clean up

Charles_LSSnyder · October 25, 2008, 6:00pm

Hi

I have a .csv file with terrible mixed text and numeric data:

Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
…hundreds more lines…

I need to extract ->
date as: mo/day/Year
Code1 as: numeric part only
Code2 as: numeric part only
eg.,

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

I have tried
f = File.open(“datadump.csv”, “r”)
v = File.open(“results.txt”, “w”)

f.each_line do |z|
myarray=[]
myarray=z.chomp.split(",")
myarray.each do |g|
g = g.gsub(/\D/,’’)
puts g
end
end

… no good
(and it doesn’t address the date conversion)…

Any help appreciated!

clsnyder

Charles_LSSnyder · October 25, 2008, 6:22pm

FasterCSV [ http://fastercsv.rubyforge.org/ ] should help you tokenize
data
from the file. Once you have that, you should be able to munge the
values as
you see fit.

Regards,
Craig

Charles_LSSnyder · October 25, 2008, 6:31pm

On Oct 25, 2008, at 10:59 AM, Charles L. Snyder wrote:

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

How’s this:

$ cat in.csv
Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
Galactica:~/Desktop$ cat filter_csv.rb
#!/usr/bin/env ruby -wKU

require “date”

require “rubygems”
require “faster_csv”

FCSV.filter(:headers => true, :return_headers => true) do |row|

remove unwanted columns

row.headers.each do |h|
row.delete(h) unless %w[DATE SERIAL Code1 Code2].include? h
end

unless row.header_row?
# convert the DATE field
row[“DATE”] = Date.strptime(row[“DATE”]).strftime("%m/%d/%Y")
# convert CodeN fields
1.upto(2) do |n|
f = “Code#{n}”
row[f] = row[f].to_s[/\d+(?:.\d+)?/]
end
end
end
Galactica:~/Desktop$ ruby filter_csv.rb in.csv > out.csv
Galactica:~/Desktop$ cat out.csv
DATE,SERIAL,Code1,Code2
09/05/2008,1013726,550.92,49500
09/04/2008,1035331,752.51,54692

Hope that helps.

James Edward G. II

Charles_LSSnyder · October 26, 2008, 1:40am

Charles L. Snyder wrote:

Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS

DATA.each_line{|line|
ary = line.split(’,’).values_at(1,2,5,6).
map{|s| s[ /^\S*/ ] }
ary[0] = ary[0].split("-").map{|s| s.to_i}.
values_at(1,2,0).join("/")
puts ary.join(",")
}

END
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS

— output —
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

Charles_LSSnyder · October 28, 2008, 1:35am

Those work perfectly

Thanks again for the help!

cls