Hi all,
So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new
line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.
I figure a script is the best way, but I’m stumped since the fields
aren’t
delimited.
Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.
I am very new to Ruby, but I am impressed so far.
And I know there’s a way to do this, I just don’t know how it’s done.
Best,
Steve (ruby noob)
On 3/30/07, Stephen S. [email protected] wrote:
This is sloppy but you get the idea.
Try something like this. You can improve it.
arr = [“abcdefgh”,“12345678”,“867-5309”]
mydata = “”
arr.each do |x|
x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”
end
puts mydata
Harry
–
http://www.kakueki.com/ruby/list.html
Japanese Ruby List Subjects in English
On Mar 30, 2007, at 1:25 AM, Harry wrote:
x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”
mydata << /(…)(.)(…)/.match(x).captures.join(’,’) << “\n”
Gary W.
Nice. You just made my month.
Thank you Harry. Thank you Gary.
Freekin’ elegant.
Steve
On 3/30/07, Gary W. [email protected] wrote:
On Mar 30, 2007, at 1:25 AM, Harry wrote:
x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”
mydata << /(…)(.)(…)/.match(x).captures.join(‘,’) << “\n”
Gary W.
‘captures’ is new to me.
Thanks.
Harry
–
http://www.kakueki.com/ruby/list.html
Japanese Ruby List Subjects in English
On 30 Mar 2007, at 06:49, Gary W. wrote:
On Mar 30, 2007, at 1:25 AM, Harry wrote:
x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”
mydata << /(…)(.)(…)/.match(x).captures.join(’,’) << “\n”
If your fixed-width fields are more than, say, five characters wide,
your regexp would be clearer by specifying the number of occurrences
of each group rather than writing a . for each occurrence. I.e.:
mydata << /(.{3})(.)(.{4})/.match(x).captures.join(’,’) << “\n”
Regards,
Andy S.
From: Stephen S. [mailto:[email protected]] :
Nice. You just made my month.
Thank you Harry. Thank you Gary.
Freekin’ elegant.
Steve
> > On Mar 30, 2007, at 1:25 AM, Harry wrote:
> > > x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
> > > mydata << $1 + “,” + $2 + “,” + $3 + “\n”
> >
> > mydata << /(…)(.)(…)/.match(x).captures.join(‘,’) << “\n”
ruby is fun, elegant, and powerful.
some more samples.
irb(main):030:0> x
=> “01234567890”
irb(main):031:0> x[/(…)…(…)/]
=> “012345678”
irb(main):032:0> $1
=> “01”
irb(main):033:0> $2
=> “5678”
irb(main):034:0> x[0…1]
=> “01”
irb(main):035:0> x[5…8]
=> “5678”
irb(main):036:0> field1=0…1
=> 0…1
irb(main):037:0> field2=5…8
=> 5…8
irb(main):038:0> x[field1]
=> “01”
irb(main):039:0> x[field2]
=> “5678”
irb(main):040:0>
kind regards -botp
On Mar 29, 2007, at 10:37 PM, Stephen S. wrote:
So I have about a half million records in a non-standard, non-
delimited
format. The fields are fixed widths, each record is separated by a
new line.
Here’s another option:
xs, ys, zs = “XXXYYYYYZZ”.unpack(“A3A5A2”)
=> [“XXX”, “YYYYY”, “ZZ”]
xs
=> “XXX”
ys
=> “YYYYY”
zs
=> “ZZ”
Hope that helps.
James Edward G. II
On 30 Mar 2007, at 12:59, James Edward G. II wrote:
Here’s another option:
xs, ys, zs = “XXXYYYYYZZ”.unpack(“A3A5A2”)
=> [“XXX”, “YYYYY”, “ZZ”]
xs
=> “XXX”
ys
=> “YYYYY”
zs
=> “ZZ”
I like the way your mind works. Nice!
Regards,
Andy S.
Thanks James. Thanks Andy.
So I’ve cleaned up the regular expression, and I like the simplicity of
the
upack message.
But I think that the format Harry and Gary suggested clearly represents
the
pattern I’m matching.
Since the pattern may change, and/or we may get other data dumps from
this
supplier in the future, I think keeping it represented clearly in one
place
will help with maintenance.
Here’s what I wound up using:
line = “”
record = “”
oldLog = File.open(“filename.txt”)
newLog = File.new(“filename_formatted.csv”, “w”)
hdr = “field1, field2, field3,…”
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(’,’)
<< “\n”
newLogg << record
record = “”
end
LOAD DATA INFILE ‘/filename.csv’
INTO TABLE data_table
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
(field1, field2, field3,…);
And now I can happily munge my monthly data dump to my boss’s hearts
content.
Thanks again guys.
I officially love Ruby.
Steve
Sweet!
Thanks James.
This will probably help quite a bit as our user base (fingers crossed)
grows. 
Steve
Gary,
Thanks for the tip. I love the native pattern matching. I had no idea
regular expressions could be such powerful objects. And I can’t think of
a
good reason not to use Ruby around the house from now on. I’m going to
sit
down with a book tonight.
And I killed the internal array, btw. I didn’t know Ruby iterates
through
files as easily as arrays.
This is like going back to school.
Grinning,
Steve
On Mar 30, 2007, at 3:31 PM, Stephen S. wrote:
{9})/.match(line).captures.join(’,’)
<< “\n”
newLogg << record
record = “”
end
How about making it so you don’t have to type that pattern next time:
def columns(*widths)
Regexp.new(widths.map {|count| “(.{#{count}})” }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)
You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):
oldLog.each { |line|
newLog.print pattern.match(line).captures.join(’,’)
}
On Mar 30, 3:47 pm, James Edward G. II [email protected]
wrote:
TESTS.times do
>> regex: 2.040000 0.000000 2.040000 ( 2.046938)
>> unpack: 0.770000 0.000000 0.770000 ( 0.763620)
END
James Edward G. II
your LINE is 300 characters but you are only unpacking 103 ???
On Mar 30, 2007, at 2:31 PM, Stephen S. wrote:
But I think that the format Harry and Gary suggested clearly
represents the pattern I’m matching.
Just FYI, when speed matters it may be worth using unpack:
#!/usr/bin/env ruby -w
require “benchmark”
TESTS = 100_000
LINE = “XYZ” * 100
Benchmark.bmbm do |results|
results.report(“regex:”) do
TESTS.times do
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
{9})/.match(LINE).captures.join(",")
end
end
results.report(“unpack:”) do
TESTS.times do
LINE.unpack(“A8A6A15A3A30A4A15A1A12A9”).join(",")
end
end
end
>> Rehearsal -------------------------------------------
>> regex: 2.060000 0.000000 2.060000 ( 2.067150)
>> unpack: 0.760000 0.000000 0.760000 ( 0.762482)
>> ---------------------------------- total: 2.820000sec
>>
>> user system total real
>> regex: 2.040000 0.000000 2.040000 ( 2.046938)
>> unpack: 0.770000 0.000000 0.770000 ( 0.763620)
END
James Edward G. II
On Mar 30, 2007, at 4:10 PM, bbiker wrote:
your LINE is 300 characters but you are only unpacking 103 ???
Yeah, I was too lazy to count it, so I just picked something big
enough. 
James Edward G. II
Stephen,
You could iterate on each line using the each method then use the string
as
an array as the fields are fixed length.
File.open(filename) |f|
f.each |line|
field1 = line[0…2]
field2 = line[3…4]
…
end
end
Kiran K.
Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.
any ideas?
Cheers!
Gary W. wrote:
On Mar 30, 2007, at 3:31 PM, Stephen S. wrote:
{9})/.match(line).captures.join(’,’)
<< “\n”
newLogg << record
record = “”
end
How about making it so you don’t have to type that pattern next time:
def columns(*widths)
Regexp.new(widths.map {|count| “(.{#{count}})” }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)
You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):
oldLog.each { |line|
newLog.print pattern.match(line).captures.join(’,’)
}
On Mar 6, 9:38 pm, Ruby Kk [email protected] wrote:
Gary W. wrote:
Regexp.new(widths.map {|count| “(.{#{count}})” }.join)
–
Posted viahttp://www.ruby-forum.com/.
sprintf allows you specify formatting that can be used to print fixed
width.
An example from the docs (ruby-doc.org) is:
sprintf(“%08b ‘%4s’”, 123, 123) #=> “01111011 ’ 123’”
Note how the %4s puts 3 characters in a space 4 characters wide.
cheers
On Sat, Mar 31, 2007 at 04:31:56AM +0900, Stephen S. wrote:
So I’ve cleaned up the regular expression, and I like the simplicity of the
upack message.
But I think that the format Harry and Gary suggested clearly represents the
pattern I’m matching.
Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.
Which could be a constant at the top of the source:
LINE_PATTERN = /^(…)(…)(…)/
Then later on in your code you can say:
record << LINE_PATTERN.match(line).captures.join(’,’) << “\n”
Regards,
Brian.