Forum: Ruby alphanumeric data clean up

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Charles L. Snyder (Guest)
on 2008-10-25 20:00
(Received via mailing list)
Hi

I have a .csv file with terrible mixed text and numeric data:

Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
....hundreds more lines...

I need to extract ->
date as: mo/day/Year
Code1 as: numeric part only
Code2 as: numeric part only
eg.,

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692


I have tried
f = File.open("datadump.csv", "r")
v = File.open("results.txt", "w")

f.each_line do |z|
myarray=[]
myarray=z.chomp.split(",")
myarray.each do |g|
g = g.gsub(/\D/,'')
puts g
end
end

... no good
(and it doesn't address the date conversion)..

Any help appreciated!

clsnyder
Craig D. (Guest)
on 2008-10-25 20:22
(Received via mailing list)
FasterCSV [ http://fastercsv.rubyforge.org/ ] should help you tokenize
data
from the file. Once you have that, you should be able to munge the
values as
you see fit.

Regards,
Craig
James G. (Guest)
on 2008-10-25 20:31
(Received via mailing list)
On Oct 25, 2008, at 10:59 AM, Charles L. Snyder wrote:

> DATE,SERIAL,Code1,Code2
> 9/5/2008,1013726,550.92,49500
> 9/4/2008,1035331,752.51,54692

How's this:

$ cat in.csv
Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
Galactica:~/Desktop$ cat filter_csv.rb
#!/usr/bin/env ruby -wKU

require "date"

require "rubygems"
require "faster_csv"

FCSV.filter(:headers => true, :return_headers => true) do |row|
   # remove unwanted columns
   row.headers.each do |h|
     row.delete(h) unless %w[DATE SERIAL Code1 Code2].include? h
   end

   unless row.header_row?
     # convert the DATE field
     row["DATE"] = Date.strptime(row["DATE"]).strftime("%m/%d/%Y")
     # convert CodeN fields
     1.upto(2) do |n|
       f = "Code#{n}"
       row[f] = row[f].to_s[/\d+(?:\.\d+)?/]
     end
   end
end
Galactica:~/Desktop$ ruby filter_csv.rb in.csv > out.csv
Galactica:~/Desktop$ cat out.csv
DATE,SERIAL,Code1,Code2
09/05/2008,1013726,550.92,49500
09/04/2008,1035331,752.51,54692

Hope that helps.

James Edward G. II
William J. (Guest)
on 2008-10-26 02:40
(Received via mailing list)
Charles L. Snyder wrote:

> Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
> A822DEA04BA7814B011074A096C687B7,2008-09-05
> 10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
> Inguinal Hernia Reducible >6 Months < 5 Years
> BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
> 10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
> ORCHIOPEXY F/INTRA-ABDL TSTIS

DATA.each_line{|line|
  ary = line.split(',').values_at(1,2,5,6).
    map{|s| s[ /^\S*/ ] }
  ary[0] = ary[0].split("-").map{|s| s.to_i}.
    values_at(1,2,0).join("/")
  puts ary.join(",")
}

__END__
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS


--- output ---
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692
Charles L. Snyder (Guest)
on 2008-10-28 02:35
(Received via mailing list)
Those work perfectly

Thanks again for the help!

cls
This topic is locked and can not be replied to.