Forum: Ruby text processing library

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Stephen S. (Guest)
on 2007-04-17 10:41
(Received via mailing list)
Hi,

So I've been digging into Ruby for the past week, and I've come across
an
interesting problem that I want to solve with my new friend. Only I
don't
want to reinvent the wheel.

So here's the problem: I have a CSV file that I need to munge into a
batch
file for a mainframe to process. This file has many (say, 30 or more)
fixed
width fields per record with distinct rules attached to each field
(e.g.,
field1 is an eight-position date following the pattern YYYYMMDD, field2
is a
five-position enumerated customer type, field3 may contain either a 70
or a
71 depending on the customer type, etc). Say, ~10K records per batch
file
(plus header, subheaders, footers, and possibly addendums) and the batch
file has to get rebuilt nightly, plus a great big monthly summary file
composed of all the nightly files strung together. I want to go from csv
to
batch file automagically, and just writing a one-off for this one file
format seems like a total waste of time.

There are a bizillion of these formats out there, and I have to imagine
spooning csv files into yet another format is pretty common. I'm
wondering
if Ruby already has a batch file library, something like text-format.rb
only
more useful?

If not then I'm going to write one... but right now I'm also likely to
write
something that looks way more like C than Ruby.
Any tips, pointers, hints, and otherwise code that might exemplify the
Ruby
way to go about it would be much appreciated. Thanks in particular to
any
ruby local out there who might take a sec to point a tourist in the
right
direction.

TIA,
Steve
Peña, Botp (Guest)
on 2007-04-17 11:06
(Received via mailing list)
From: Stephen S. [mailto:removed_email_address@domain.invalid] :
# go from csv to
# batch file automagically, and just writing a one-off for this one file
# format seems like a total waste of time.
# There are a bizillion of these formats out there, and I have
# to imagine
# spooning csv files into yet another format is pretty common.
# I'm wondering
# if Ruby already has a batch file library, something like
# text-format.rb only
# more useful?

just in case you'd want to start anew, try getting some ideas fr
  1. http://www.devsource.com/article2/0,1895,1928561,00.asp
  2. http://fastercsv.rubyforge.org/

kind regards -botp
James G. (Guest)
on 2007-04-17 16:45
(Received via mailing list)
On Apr 17, 2007, at 1:40 AM, Stephen S. wrote:

> 71 depending on the customer type, etc).
I'm a little confused by your description of the file.  You call it a
CSV file and say it has fixed-width fields, but those are two
different things.

Either way though, Ruby has the tools you need.

For CSV data, see the standard "csv" library.  For splitting up fixed
width fields, a call to String#unpack will do.

Hope that helps.

James Edward G. II
Stephen S. (Guest)
on 2007-04-17 19:50
(Received via mailing list)
Thanks for asking for clarification.

I frequently have to turn CSV files into files that follow different
fixed-width formats. So it's not that I need to unpack my data. Rather,
I
need a way to take csv'd data, and a separate file format or even
specification, and "pack" the CSV data into the format. For example,
right
now I need to generate a nightly batch file with headers and footers and
arbitrarily complicated field specifications from a largish csv file.

And the formats that I have to satisfy are often complex, so doing
one-off
conversion scripts can get to be a pain to maintain very quickly.

Your FasterCSV and String#unpack seem like a great place to start.  At a
minimum, I need to be able to attach rules to each field like the number
of
positions (as well as packing character, eg, whitespace or zeroes, and
maybe
formulas based on other field values) in my format string. Is there an
equivalent of String#pack out there?

Steve
Stephen S. (Guest)
on 2007-04-17 19:59
(Received via mailing list)
@Botp, That's right up my alley... Here's where I'm headed. Since CSV
file
structure is a bit like vanilla icecream, I'd like to provide a file
structure with more complexity or "flavor" as an input, along with some
CSV'd data. That way I can read in the structure and the data separately
and
get my data out in the format I want. My inspiration is deBabelizer, but
also now simply some kind of String#pack.
Alex LeDonne (Guest)
on 2007-04-17 20:05
(Received via mailing list)
On 4/17/07, Stephen S. <removed_email_address@domain.invalid> wrote:
[snip]
> Your FasterCSV and String#unpack seem like a great place to start.  At a
> minimum, I need to be able to attach rules to each field like the number of
> positions (as well as packing character, eg, whitespace or zeroes, and maybe
> formulas based on other field values) in my format string. Is there an
> equivalent of String#pack out there?
>
> Steve
>

Array#pack

So unpack goes from String to Array, and pack goes from Array to String.

-A
James G. (Guest)
on 2007-04-17 20:55
(Received via mailing list)
On Apr 17, 2007, at 10:49 AM, Stephen S. wrote:

> I frequently have to turn CSV files into files that follow different
> fixed-width formats.

Ah, I understand now.  Your project is to translate CSV to fixed-
width.  Got it.

Obviously I'm biased, but FasterCSV should give you rich handling on
the reading end, I think.  That part should be pretty covered.

Where you are likely to spend the effort is in the fixed-width
writing.  There is an Array#pack, as others have pointed out, but you
sound like you're after something higher level than that.  You want
it to catch the datas and pack them as YYYYMMDD strings for you and
the like.

If you come up with a good solution it may be worth generalizing and
sharing, in my opinion.

James Edward G. II
unknown (Guest)
on 2007-04-17 21:24
(Received via mailing list)
Hi,

Since you are describing a lot of what I do (but not in Ruby), I thought
I
might point you here for some ideas:  http://www.unidex.com/overview.htm

Essentially, that link describes a flat-file parser that reads a XML
definition of the layout.  You might want to use that and model your own
schema to accomplish something similar in Ruby.

Best Regards,
Dan




             "Stephen S."
             <removed_email_address@domain.invalid
             >
To
                                       removed_email_address@domain.invalid
(ruby-talk
             04/17/2007 11:12          ML)
             AM
cc

                                                                   Subject
             Please respond to         Re: text processing library
             ruby-talk@ruby-la
                  ng.org








@Botp, That's right up my alley... Here's where I'm headed. Since CSV
file
structure is a bit like vanilla icecream, I'd like to provide a file
structure with more complexity or "flavor" as an input, along with some
CSV'd data. That way I can read in the structure and the data separately
and
get my data out in the format I want. My inspiration is deBabelizer, but
also now simply some kind of String#pack.

On 4/17/07, Peña, Botp <removed_email_address@domain.invalid> wrote:
> # text-format.rb only
> # more useful?
>
> just in case you'd want to start anew, try getting some ideas fr
>   1. http://www.devsource.com/article2/0,1895,1928561,00.asp
>   2. http://fastercsv.rubyforge.org/
>
> kind regards -botp
>
>
>


.
This message and any attachments contain information from Union Pacific
which may be confidential and/or privileged.
If you are not the intended recipient, be aware that any disclosure,
copying, distribution or use of the contents of this message is strictly
prohibited by law. If you receive this message in error, please contact
the sender immediately and delete the message and any attachments.
Stephen S. (Guest)
on 2007-04-18 06:31
(Received via mailing list)
@James, Right on. I think you're correct - FasterCSV should handle the
CSV
reading no problem. I think my first edition might handle writing to a
flat
file based on an XML schema. I'm going to try to think about
generalization
from the beginning, and see if that keeps my code easier to maintain and
use. When you wrote FasterCSV, how did you bake in the formatting rules?
Did
you write an XML schema or something similar based on the CSV RFC?

@Dan, Many thanks! XFlat seems like exactly what I need. Only I can't
find
the DTD anywhere... is it proprietary?

Now you've both got me thinking about a text conversion swiss army knife
...
so long as a conversion library has access to input and output schemas,
there's no reason why our well-formed data can't enjoy total freedom.
Which
I probably can't convince anyone to pay me to write, but I might anyway.
Hm... In any event, my little csv-to-flat_file_format_X tool is a good
way
to introduce myself to text processing in Ruby. (And I'm hoping not that
tough...) :-)

This seems like a much stronger idea now. Thanks guys.

4fires

On 4/17/07, removed_email_address@domain.invalid 
<removed_email_address@domain.invalid> wrote:
> Best Regards,
>              AM                                                         cc
>
> On 4/17/07, Peña, Botp <removed_email_address@domain.invalid> wrote:
> > # text-format.rb only
>
>
> . 
This
> message and any attachments contain information from Union Pacific which may
> be confidential and/or privileged.
> If you are not the intended recipient, be aware that any disclosure,
> copying, distribution or use of the contents of this message is strictly
> prohibited by law. If you receive this message in error, please contact the
> sender immediately and delete the message and any attachments.
>
>
James G. (Guest)
on 2007-04-18 07:08
(Received via mailing list)
On Apr 17, 2007, at 9:31 PM, Stephen S. wrote:

> you write an XML schema or something similar based on the CSV RFC?
FasterCSV was born from a discussion on Ruby Core about how we might
speed up the CSV library.  I provided some information out of the
book Mastering Regular Expression, which claimed to have a single
expression for parsing the format.

Some edge case that the expression didn't handle where raised, I
fixed those, and that's pretty much FasterCSV's parser to this day.
It's not too glamorous I guess, but I like how it shows what we can
do when we work together.

James Edward G. II
This topic is locked and can not be replied to.