Alphanumeric data clean up

Hi

I have a .csv file with terrible mixed text and numeric data:

Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
…hundreds more lines…

I need to extract ->
date as: mo/day/Year
Code1 as: numeric part only
Code2 as: numeric part only
eg.,

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

I have tried
f = File.open(“datadump.csv”, “r”)
v = File.open(“results.txt”, “w”)

f.each_line do |z|
myarray=[]
myarray=z.chomp.split(",")
myarray.each do |g|
g = g.gsub(/\D/,’’)
puts g
end
end

… no good
(and it doesn’t address the date conversion)…

Any help appreciated!

clsnyder

FasterCSV [ http://fastercsv.rubyforge.org/ ] should help you tokenize
data
from the file. Once you have that, you should be able to munge the
values as
you see fit.

Regards,
Craig

On Oct 25, 2008, at 10:59 AM, Charles L. Snyder wrote:

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

How’s this:

$ cat in.csv
Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
Galactica:~/Desktop$ cat filter_csv.rb
#!/usr/bin/env ruby -wKU

require “date”

require “rubygems”
require “faster_csv”

FCSV.filter(:headers => true, :return_headers => true) do |row|

remove unwanted columns

row.headers.each do |h|
row.delete(h) unless %w[DATE SERIAL Code1 Code2].include? h
end

unless row.header_row?
# convert the DATE field
row[“DATE”] = Date.strptime(row[“DATE”]).strftime("%m/%d/%Y")
# convert CodeN fields
1.upto(2) do |n|
f = “Code#{n}”
row[f] = row[f].to_s[/\d+(?:.\d+)?/]
end
end
end
Galactica:~/Desktop$ ruby filter_csv.rb in.csv > out.csv
Galactica:~/Desktop$ cat out.csv
DATE,SERIAL,Code1,Code2
09/05/2008,1013726,550.92,49500
09/04/2008,1035331,752.51,54692

Hope that helps.

James Edward G. II

Charles L. Snyder wrote:

Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS

DATA.each_line{|line|
ary = line.split(’,’).values_at(1,2,5,6).
map{|s| s[ /^\S*/ ] }
ary[0] = ary[0].split("-").map{|s| s.to_i}.
values_at(1,2,0).join("/")
puts ary.join(",")
}

END
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS

— output —
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692

Those work perfectly

Thanks again for the help!

cls