Text processing library


#1

Hi,

So I’ve been digging into Ruby for the past week, and I’ve come across
an
interesting problem that I want to solve with my new friend. Only I
don’t
want to reinvent the wheel.

So here’s the problem: I have a CSV file that I need to munge into a
batch
file for a mainframe to process. This file has many (say, 30 or more)
fixed
width fields per record with distinct rules attached to each field
(e.g.,
field1 is an eight-position date following the pattern YYYYMMDD, field2
is a
five-position enumerated customer type, field3 may contain either a 70
or a
71 depending on the customer type, etc). Say, ~10K records per batch
file
(plus header, subheaders, footers, and possibly addendums) and the batch
file has to get rebuilt nightly, plus a great big monthly summary file
composed of all the nightly files strung together. I want to go from csv
to
batch file automagically, and just writing a one-off for this one file
format seems like a total waste of time.

There are a bizillion of these formats out there, and I have to imagine
spooning csv files into yet another format is pretty common. I’m
wondering
if Ruby already has a batch file library, something like text-format.rb
only
more useful?

If not then I’m going to write one… but right now I’m also likely to
write
something that looks way more like C than Ruby.
Any tips, pointers, hints, and otherwise code that might exemplify the
Ruby
way to go about it would be much appreciated. Thanks in particular to
any
ruby local out there who might take a sec to point a tourist in the
right
direction.

TIA,
Steve


#2

From: Stephen S. [mailto:removed_email_address@domain.invalid] :

go from csv to

batch file automagically, and just writing a one-off for this one file

format seems like a total waste of time.

There are a bizillion of these formats out there, and I have

to imagine

spooning csv files into yet another format is pretty common.

I’m wondering

if Ruby already has a batch file library, something like

text-format.rb only

more useful?

just in case you’d want to start anew, try getting some ideas fr

  1. http://www.devsource.com/article2/0,1895,1928561,00.asp
  2. http://fastercsv.rubyforge.org/

kind regards -botp


#3

On Apr 17, 2007, at 1:40 AM, Stephen S. wrote:

71 depending on the customer type, etc).
I’m a little confused by your description of the file. You call it a
CSV file and say it has fixed-width fields, but those are two
different things.

Either way though, Ruby has the tools you need.

For CSV data, see the standard “csv” library. For splitting up fixed
width fields, a call to String#unpack will do.

Hope that helps.

James Edward G. II


#4

Thanks for asking for clarification.

I frequently have to turn CSV files into files that follow different
fixed-width formats. So it’s not that I need to unpack my data. Rather,
I
need a way to take csv’d data, and a separate file format or even
specification, and “pack” the CSV data into the format. For example,
right
now I need to generate a nightly batch file with headers and footers and
arbitrarily complicated field specifications from a largish csv file.

And the formats that I have to satisfy are often complex, so doing
one-off
conversion scripts can get to be a pain to maintain very quickly.

Your FasterCSV and String#unpack seem like a great place to start. At a
minimum, I need to be able to attach rules to each field like the number
of
positions (as well as packing character, eg, whitespace or zeroes, and
maybe
formulas based on other field values) in my format string. Is there an
equivalent of String#pack out there?

Steve


#5

@Botp, That’s right up my alley… Here’s where I’m headed. Since CSV
file
structure is a bit like vanilla icecream, I’d like to provide a file
structure with more complexity or “flavor” as an input, along with some
CSV’d data. That way I can read in the structure and the data separately
and
get my data out in the format I want. My inspiration is deBabelizer, but
also now simply some kind of String#pack.


#6

On Apr 17, 2007, at 10:49 AM, Stephen S. wrote:

I frequently have to turn CSV files into files that follow different
fixed-width formats.

Ah, I understand now. Your project is to translate CSV to fixed-
width. Got it.

Obviously I’m biased, but FasterCSV should give you rich handling on
the reading end, I think. That part should be pretty covered.

Where you are likely to spend the effort is in the fixed-width
writing. There is an Array#pack, as others have pointed out, but you
sound like you’re after something higher level than that. You want
it to catch the datas and pack them as YYYYMMDD strings for you and
the like.

If you come up with a good solution it may be worth generalizing and
sharing, in my opinion.

James Edward G. II


#7

On 4/17/07, Stephen S. removed_email_address@domain.invalid wrote:
[snip]

Your FasterCSV and String#unpack seem like a great place to start. At a
minimum, I need to be able to attach rules to each field like the number of
positions (as well as packing character, eg, whitespace or zeroes, and maybe
formulas based on other field values) in my format string. Is there an
equivalent of String#pack out there?

Steve

Array#pack

So unpack goes from String to Array, and pack goes from Array to String.

-A


#8

Hi,

Since you are describing a lot of what I do (but not in Ruby), I thought
I
might point you here for some ideas: http://www.unidex.com/overview.htm

Essentially, that link describes a flat-file parser that reads a XML
definition of the layout. You might want to use that and model your own
schema to accomplish something similar in Ruby.

Best Regards,
Dan

         "Stephen S."
         <removed_email_address@domain.invalid
         > 

To
removed_email_address@domain.invalid
(ruby-talk
04/17/2007 11:12 ML)
AM
cc

                                                               Subject
         Please respond to         Re: text processing library
         ruby-talk@ruby-la
              ng.org

@Botp, That’s right up my alley… Here’s where I’m headed. Since CSV
file
structure is a bit like vanilla icecream, I’d like to provide a file
structure with more complexity or “flavor” as an input, along with some
CSV’d data. That way I can read in the structure and the data separately
and
get my data out in the format I want. My inspiration is deBabelizer, but
also now simply some kind of String#pack.

On 4/17/07, Peña, Botp removed_email_address@domain.invalid wrote:

text-format.rb only

more useful?

just in case you’d want to start anew, try getting some ideas fr

  1. http://www.devsource.com/article2/0,1895,1928561,00.asp
  2. http://fastercsv.rubyforge.org/

kind regards -botp

.
This message and any attachments contain information from Union Pacific
which may be confidential and/or privileged.
If you are not the intended recipient, be aware that any disclosure,
copying, distribution or use of the contents of this message is strictly
prohibited by law. If you receive this message in error, please contact
the sender immediately and delete the message and any attachments.


#9

@James, Right on. I think you’re correct - FasterCSV should handle the
CSV
reading no problem. I think my first edition might handle writing to a
flat
file based on an XML schema. I’m going to try to think about
generalization
from the beginning, and see if that keeps my code easier to maintain and
use. When you wrote FasterCSV, how did you bake in the formatting rules?
Did
you write an XML schema or something similar based on the CSV RFC?

@Dan, Many thanks! XFlat seems like exactly what I need. Only I can’t
find
the DTD anywhere… is it proprietary?

Now you’ve both got me thinking about a text conversion swiss army knife

so long as a conversion library has access to input and output schemas,
there’s no reason why our well-formed data can’t enjoy total freedom.
Which
I probably can’t convince anyone to pay me to write, but I might anyway.
Hm… In any event, my little csv-to-flat_file_format_X tool is a good
way
to introduce myself to text processing in Ruby. (And I’m hoping not that
tough…) :slight_smile:

This seems like a much stronger idea now. Thanks guys.

4fires

On 4/17/07, removed_email_address@domain.invalid removed_email_address@domain.invalid wrote:

Best Regards,
AM cc

On 4/17/07, Peña, Botp removed_email_address@domain.invalid wrote:

text-format.rb only

. This
message and any attachments contain information from Union Pacific which may
be confidential and/or privileged.
If you are not the intended recipient, be aware that any disclosure,
copying, distribution or use of the contents of this message is strictly
prohibited by law. If you receive this message in error, please contact the
sender immediately and delete the message and any attachments.


#10

On Apr 17, 2007, at 9:31 PM, Stephen S. wrote:

you write an XML schema or something similar based on the CSV RFC?
FasterCSV was born from a discussion on Ruby Core about how we might
speed up the CSV library. I provided some information out of the
book Mastering Regular Expression, which claimed to have a single
expression for parsing the format.

Some edge case that the expression didn’t handle where raised, I
fixed those, and that’s pretty much FasterCSV’s parser to this day.
It’s not too glamorous I guess, but I like how it shows what we can
do when we work together.

James Edward G. II