Forum: Ruby alphanumeric data clean up

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Ddf11961eb0273eb31914c1232706ffe?d=identicon&s=25 Charles L. Snyder (Guest)
on 2008-10-25 18:00
(Received via mailing list)
Hi

I have a .csv file with terrible mixed text and numeric data:

Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
....hundreds more lines...

I need to extract ->
date as: mo/day/Year
Code1 as: numeric part only
Code2 as: numeric part only
eg.,

DATE,SERIAL,Code1,Code2
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692


I have tried
f = File.open("datadump.csv", "r")
v = File.open("results.txt", "w")

f.each_line do |z|
myarray=[]
myarray=z.chomp.split(",")
myarray.each do |g|
g = g.gsub(/\D/,'')
puts g
end
end

... no good
(and it doesn't address the date conversion)..

Any help appreciated!

clsnyder
B14575f0ca69f10938fdd67e7156e0e1?d=identicon&s=25 Craig Demyanovich (Guest)
on 2008-10-25 18:22
(Received via mailing list)
FasterCSV [ http://fastercsv.rubyforge.org/ ] should help you tokenize
data
from the file. Once you have that, you should be able to munge the
values as
you see fit.

Regards,
Craig
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2008-10-25 18:31
(Received via mailing list)
On Oct 25, 2008, at 10:59 AM, Charles L. Snyder wrote:

> DATE,SERIAL,Code1,Code2
> 9/5/2008,1013726,550.92,49500
> 9/4/2008,1035331,752.51,54692

How's this:

$ cat in.csv
Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS
Galactica:~/Desktop$ cat filter_csv.rb
#!/usr/bin/env ruby -wKU

require "date"

require "rubygems"
require "faster_csv"

FCSV.filter(:headers => true, :return_headers => true) do |row|
   # remove unwanted columns
   row.headers.each do |h|
     row.delete(h) unless %w[DATE SERIAL Code1 Code2].include? h
   end

   unless row.header_row?
     # convert the DATE field
     row["DATE"] = Date.strptime(row["DATE"]).strftime("%m/%d/%Y")
     # convert CodeN fields
     1.upto(2) do |n|
       f = "Code#{n}"
       row[f] = row[f].to_s[/\d+(?:\.\d+)?/]
     end
   end
end
Galactica:~/Desktop$ ruby filter_csv.rb in.csv > out.csv
Galactica:~/Desktop$ cat out.csv
DATE,SERIAL,Code1,Code2
09/05/2008,1013726,550.92,49500
09/04/2008,1035331,752.51,54692

Hope that helps.

James Edward Gray II
2ee1a7960cc761a6e92efb5000c0f2c9?d=identicon&s=25 William James (Guest)
on 2008-10-26 01:40
(Received via mailing list)
Charles L. Snyder wrote:

> Case ID,DATE,SERIAL,NAME,Gender,Code1,Code2
> A822DEA04BA7814B011074A096C687B7,2008-09-05
> 10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
> Inguinal Hernia Reducible >6 Months < 5 Years
> BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
> 10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
> ORCHIOPEXY F/INTRA-ABDL TSTIS

DATA.each_line{|line|
  ary = line.split(',').values_at(1,2,5,6).
    map{|s| s[ /^\S*/ ] }
  ary[0] = ary[0].split("-").map{|s| s.to_i}.
    values_at(1,2,0).join("/")
  puts ary.join(",")
}

__END__
A822DEA04BA7814B011074A096C687B7,2008-09-05
10:28:38.47,1013726,Smith,Male,550.92 RED INGUINAL HERNIA,49500
Inguinal Hernia Reducible >6 Months < 5 Years
BEC7E6B8B4F0C53544BA9D2524A3746B,2008-09-04
10:29:47.81,1035331,Bailey,Male,752.51 UNDESCENDED TESTIS,54692 LAPS
ORCHIOPEXY F/INTRA-ABDL TSTIS


--- output ---
9/5/2008,1013726,550.92,49500
9/4/2008,1035331,752.51,54692
Ddf11961eb0273eb31914c1232706ffe?d=identicon&s=25 Charles L. Snyder (Guest)
on 2008-10-28 01:35
(Received via mailing list)
Those work perfectly

Thanks again for the help!

cls
This topic is locked and can not be replied to.