Text processing

sdsykes · March 30, 2007, 5:38am

Hi all,

So I have about a half million records in a non-standard, non-delimited
format. The fields are fixed widths, each record is separated by a new
line.
I need to get this data into csv format so I can import it into a mysql
database for analysis.

I figure a script is the best way, but I’m stumped since the fields
aren’t
delimited.

Any tips, suggestions, pointers or a script that does something similar
would be GREATLY appreciated.

I am very new to Ruby, but I am impressed so far.

And I know there’s a way to do this, I just don’t know how it’s done.

Best,
Steve (ruby noob)

sdsykes · March 30, 2007, 7:27am

On 3/30/07, Stephen S. [email protected] wrote:

This is sloppy but you get the idea.
Try something like this. You can improve it.

arr = [“abcdefgh”,“12345678”,“867-5309”]
mydata = “”
arr.each do |x|
x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”
end

puts mydata

Harry

–

http://www.kakueki.com/ruby/list.html
Japanese Ruby List Subjects in English

sdsykes · March 30, 2007, 7:50am

On Mar 30, 2007, at 1:25 AM, Harry wrote:

x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”

mydata << /(…)(.)(…)/.match(x).captures.join(’,’) << “\n”

Gary W.

sdsykes · March 30, 2007, 9:43am

Nice. You just made my month.

Thank you Harry. Thank you Gary.

Freekin’ elegant.

Steve

sdsykes · March 30, 2007, 8:25am

On 3/30/07, Gary W. [email protected] wrote:

On Mar 30, 2007, at 1:25 AM, Harry wrote:

x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”

mydata << /(…)(.)(…)/.match(x).captures.join(‘,’) << “\n”

Gary W.

‘captures’ is new to me.
Thanks.

Harry

–

http://www.kakueki.com/ruby/list.html
Japanese Ruby List Subjects in English

sdsykes · March 30, 2007, 10:29am

On 30 Mar 2007, at 06:49, Gary W. wrote:

On Mar 30, 2007, at 1:25 AM, Harry wrote:

x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4
mydata << $1 + “,” + $2 + “,” + $3 + “\n”

mydata << /(…)(.)(…)/.match(x).captures.join(’,’) << “\n”

If your fixed-width fields are more than, say, five characters wide,
your regexp would be clearer by specifying the number of occurrences
of each group rather than writing a . for each occurrence. I.e.:

mydata << /(.{3})(.)(.{4})/.match(x).captures.join(’,’) << “\n”

Regards,
Andy S.

sdsykes · March 30, 2007, 12:53pm

From: Stephen S. [mailto:[email protected]] :

Nice. You just made my month.

Thank you Harry. Thank you Gary.

Freekin’ elegant.

Steve

On 3/29/07, Harry [email protected] wrote:

> On 3/30/07, Gary W. [email protected] wrote:

> > On Mar 30, 2007, at 1:25 AM, Harry wrote:

> > > x =~ /(…)(.)(…)/ #grabs 3 characters, then 1, then 4

> > > mydata << $1 + “,” + $2 + “,” + $3 + “\n”

> >

> > mydata << /(…)(.)(…)/.match(x).captures.join(‘,’) << “\n”

ruby is fun, elegant, and powerful.
some more samples.

irb(main):030:0> x
=> “01234567890”
irb(main):031:0> x[/(…)…(…)/]
=> “012345678”
irb(main):032:0> $1
=> “01”
irb(main):033:0> $2
=> “5678”
irb(main):034:0> x[0…1]
=> “01”
irb(main):035:0> x[5…8]
=> “5678”
irb(main):036:0> field1=0…1
=> 0…1
irb(main):037:0> field2=5…8
=> 5…8
irb(main):038:0> x[field1]
=> “01”
irb(main):039:0> x[field2]
=> “5678”
irb(main):040:0>

kind regards -botp

sdsykes · March 30, 2007, 2:00pm

On Mar 29, 2007, at 10:37 PM, Stephen S. wrote:

So I have about a half million records in a non-standard, non-
delimited
format. The fields are fixed widths, each record is separated by a
new line.

Here’s another option:

xs, ys, zs = “XXXYYYYYZZ”.unpack(“A3A5A2”)
=> [“XXX”, “YYYYY”, “ZZ”]

xs
=> “XXX”

ys
=> “YYYYY”

zs
=> “ZZ”

Hope that helps.

James Edward G. II

sdsykes · March 30, 2007, 4:15pm

On 30 Mar 2007, at 12:59, James Edward G. II wrote:

Here’s another option:

xs, ys, zs = “XXXYYYYYZZ”.unpack(“A3A5A2”)
=> [“XXX”, “YYYYY”, “ZZ”]

xs
=> “XXX”

ys
=> “YYYYY”

zs
=> “ZZ”

I like the way your mind works. Nice!

Regards,
Andy S.

sdsykes · March 30, 2007, 9:32pm

Thanks James. Thanks Andy.

So I’ve cleaned up the regular expression, and I like the simplicity of
the
upack message.

But I think that the format Harry and Gary suggested clearly represents
the
pattern I’m matching.

Since the pattern may change, and/or we may get other data dumps from
this
supplier in the future, I think keeping it represented clearly in one
place
will help with maintenance.

Here’s what I wound up using:

line = “”
record = “”
oldLog = File.open(“filename.txt”)
newLog = File.new(“filename_formatted.csv”, “w”)
hdr = “field1, field2, field3,…”
newLog << hdr
arr = oldLog.readlines
arr.each do |line|
record <<
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.{9})/.match(line).captures.join(’,’)
<< “\n”
newLogg << record
record = “”
end

LOAD DATA INFILE ‘/filename.csv’
INTO TABLE data_table
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
(field1, field2, field3,…);

And now I can happily munge my monthly data dump to my boss’s hearts
content.

Thanks again guys.

I officially love Ruby.

Steve

sdsykes · March 30, 2007, 10:10pm

Sweet!

Thanks James.

This will probably help quite a bit as our user base (fingers crossed)
grows.

Steve

sdsykes · March 30, 2007, 10:40pm

Gary,

Thanks for the tip. I love the native pattern matching. I had no idea
regular expressions could be such powerful objects. And I can’t think of
a
good reason not to use Ruby around the house from now on. I’m going to
sit
down with a book tonight.

And I killed the internal array, btw. I didn’t know Ruby iterates
through
files as easily as arrays.

This is like going back to school.

Grinning,
Steve

sdsykes · March 30, 2007, 10:16pm

On Mar 30, 2007, at 3:31 PM, Stephen S. wrote:

{9})/.match(line).captures.join(’,’)
<< “\n”
newLogg << record
record = “”
end

How about making it so you don’t have to type that pattern next time:

def columns(*widths)
Regexp.new(widths.map {|count| “(.{#{count}})” }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
newLog.print pattern.match(line).captures.join(’,’)
}

sdsykes · March 30, 2007, 11:10pm

On Mar 30, 3:47 pm, James Edward G. II [email protected]
wrote:

 TESTS.times do
>> regex: 2.040000 0.000000 2.040000 ( 2.046938)

>> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

END

James Edward G. II

your LINE is 300 characters but you are only unpacking 103 ???

sdsykes · March 30, 2007, 9:48pm

On Mar 30, 2007, at 2:31 PM, Stephen S. wrote:

But I think that the format Harry and Gary suggested clearly
represents the pattern I’m matching.

Just FYI, when speed matters it may be worth using unpack:

#!/usr/bin/env ruby -w

require “benchmark”

TESTS = 100_000
LINE = “XYZ” * 100
Benchmark.bmbm do |results|
results.report(“regex:”) do
TESTS.times do
/(.{8})(.{6})(.{15})(.{3})(.{30})(.{4})(.{15})(.{1})(.{12})(.
{9})/.match(LINE).captures.join(",")
end
end
results.report(“unpack:”) do
TESTS.times do
LINE.unpack(“A8A6A15A3A30A4A15A1A12A9”).join(",")
end
end
end

>> Rehearsal -------------------------------------------

>> regex: 2.060000 0.000000 2.060000 ( 2.067150)

>> unpack: 0.760000 0.000000 0.760000 ( 0.762482)

>> ---------------------------------- total: 2.820000sec

>>

>> user system total real

>> regex: 2.040000 0.000000 2.040000 ( 2.046938)

>> unpack: 0.770000 0.000000 0.770000 ( 0.763620)

END

James Edward G. II

sdsykes · March 30, 2007, 11:33pm

On Mar 30, 2007, at 4:10 PM, bbiker wrote:

your LINE is 300 characters but you are only unpacking 103 ???

Yeah, I was too lazy to count it, so I just picked something big
enough.

James Edward G. II

sdsykes · March 30, 2007, 11:15pm

Stephen,

You could iterate on each line using the each method then use the string
as
an array as the fields are fixed length.

File.open(filename) |f|
f.each |line|
field1 = line[0…2]
field2 = line[3…4]
…
end
end

Kiran K.

sdsykes · March 7, 2009, 3:40am

Hi there,
I am new to ruby and trying to create a flat file with fixed column
length. i have a header,footer and records that i want in the file.

any ideas?

Cheers!

Gary W. wrote:

On Mar 30, 2007, at 3:31 PM, Stephen S. wrote:

{9})/.match(line).captures.join(’,’)
<< “\n”
newLogg << record
record = “”
end

How about making it so you don’t have to type that pattern next time:

def columns(*widths)
Regexp.new(widths.map {|count| “(.{#{count}})” }.join)
end
pattern = columns(6,15,3,30,4,15,1,12,9)

You also can get rid of the intermediate strings and the potentially
huge
internal array (from slurping up your oldLog into an array):

oldLog.each { |line|
newLog.print pattern.match(line).captures.join(’,’)
}

sdsykes · March 7, 2009, 5:41am

On Mar 6, 9:38 pm, Ruby Kk [email protected] wrote:

Gary W. wrote:

Regexp.new(widths.map {|count| “(.{#{count}})” }.join)

–
Posted viahttp://www.ruby-forum.com/.

sprintf allows you specify formatting that can be used to print fixed
width.
An example from the docs (ruby-doc.org) is:
sprintf(“%08b ‘%4s’”, 123, 123) #=> “01111011 ’ 123’”
Note how the %4s puts 3 characters in a space 4 characters wide.

cheers

sdsykes · March 31, 2007, 10:20am

On Sat, Mar 31, 2007 at 04:31:56AM +0900, Stephen S. wrote:

So I’ve cleaned up the regular expression, and I like the simplicity of the
upack message.

But I think that the format Harry and Gary suggested clearly represents the
pattern I’m matching.

Since the pattern may change, and/or we may get other data dumps from this
supplier in the future, I think keeping it represented clearly in one place
will help with maintenance.

Which could be a constant at the top of the source:

LINE_PATTERN = /^(…)(…)(…)/

Then later on in your code you can say:

record << LINE_PATTERN.match(line).captures.join(’,’) << “\n”

Regards,

Brian.