Scraping off a Word document?

Here’s a conceptual question. I have a Word mail merge, with a few
dozen documents. There’s a certain field, let’s call in the Employee
field, on each page. These documents are sorted in order of this
field. What I’d like to do is save off each group of pages into its
own Word document under that field. So if it’s Employee: Joe Schmoe on
the first five pages I’d want to save off just those pages and name
the file “Joe Schmoe.Doc” and so on.

The mail merge itself is pretty much hard-coded into a big group of
documents, so that’s my basis. Any suggestions about what Ruby modules
and methods I’d start out delving into? I’m thinking win32ole of
course, but have a good-sized task ahead of me I have to deliver in
relatively short order :-/

“gregarican” [email protected] wrote in message
news:[email protected]

and methods I’d start out delving into? I’m thinking win32ole of
course, but have a good-sized task ahead of me I have to deliver in
relatively short order :-/

I’m not sure you can do what you want to do.

You open a Word doc, then want to save a portion based on the Employee
Name
to it’s own file? Then, after that Save is finished, keep the file open,
advance the database to the next Employee Name and repeat the save, then
repeat the entire process until you get through all of the Employee
Names?

It occurs to me that Word can’t do that task because the Employee Name
field
in the document is an unknown until the actual time of the merge. There
is
only one DOC file for any given letter, and when I do these kinds of
merge
all I get to see is where the variables are that get filled
in
during the merge. You have to fill the variable from the database then
save
the result, advance the database to the next record and fill the
variable
again to save that result.

You are going to create a file for each employee for each letter, and
this
seems to me to defeat the whole reason to merge data into a document.
The
reason I merge is because I want one file for everybody, I specifically
do
not want a separate file for each person.

gregarican wrote:

and methods I’d start out delving into? I’m thinking win32ole of
course, but have a good-sized task ahead of me I have to deliver in
relatively short order :-/

Write what you need using the VBA built into Word. Intellisense will
make that
rather easy.

Then either replicate your VBA calls using Ruby’s win32ole…

…or just shell directly from Ruby to your VBA!

On 16.01.2009 16:51, gregarican wrote:

and methods I’d start out delving into? I’m thinking win32ole of
course, but have a good-sized task ahead of me I have to deliver in
relatively short order :-/

I’d do it with VB from inside Word. An alternative might be to use
OpenOffice, read the word, write OO’s format (XML in ZIP) and the
manipulate the XML. But this sounds pretty awkward.

Can’t you force the mail merge to produce multiple documents?

Cheers

robert

Robert K. wrote:

I’d do it with VB from inside Word. An alternative might be to use
OpenOffice, read the word, write OO’s format (XML in ZIP) and the
manipulate the XML. But this sounds pretty awkward.

I suspect Word can also barf out an XML representation.

It may be awkward (get ready for the horror when you open that file!),
but it’s
probably the best way. All word processing is heading towards XML for
its
interoperability.

On Jan 17, 11:18 am, Phlip [email protected] wrote:

Robert K. wrote:

I’d do it with VB from inside Word. An alternative might be to use
OpenOffice, read the word, write OO’s format (XML in ZIP) and the
manipulate the XML. But this sounds pretty awkward.

I suspect Word can also barf out an XML representation.

It may be awkward (get ready for the horror when you open that file!), but it’s
probably the best way. All word processing is heading towards XML for its
interoperability.

I wound up writing a C# console program to do the work. I just
referred to ugly underbelly of all of the Word COM stuff and was able
to grab what I needed. It took awhile though, since my text was
contained within text frames. So I had to work with the
Document.Shapes property and whatnot.

In searching for a solution I did run across a VBA code snippet that
would save off each document separately after the mail merge
completed. At least now I have a totally automated solution, although
it’s cobbled together from various sources. First I pull my data from
a SQL DB using Ruby, dumping that to an Excel data source. Then I have
a C# program that takes that data source, uses a Word mail merge
template and delivers the final document set. Finally, I have a Ruby
program that looks in that save directory and e-mails the documents to
the individual employees. Eventually it’d be a lot cleaner and easier
to maintain if I had all of the work done in a single program written
in a single language. But that’s another fight for another day :slight_smile:

On 18.01.2009 03:18, gregarican wrote:

I wound up writing a C# console program to do the work. I just
referred to ugly underbelly of all of the Word COM stuff and was able
to grab what I needed. It took awhile though, since my text was
contained within text frames. So I had to work with the
Document.Shapes property and whatnot.

I’d say that’s pretty fast. Good job!

in a single language. But that’s another fight for another day :slight_smile:
May I suggest a different approach? Since your primary step is pulling
data from a relational DB using Ruby, you could as well do this: open
the mail merge Word template, replace mail merge fields with text with
special formatting (for example “<<>>” or whatever doesn’t
collide with RTF meta sequences). Then you save this as RTF file (ASCII
readable). Now you only need to read in the mail template file from
Ruby, do all the replacements and then write it out in Ruby again once
for each record. Sounds pretty simple IMHO.

Kind regards

robert

“gregarican” [email protected] wrote in message
news:[email protected]
On Jan 17, 11:18 am, Phlip [email protected] wrote:

interoperability.
I wound up writing a C# console program to do the work. I just
referred to ugly underbelly of all of the Word COM stuff and was able
to grab what I needed. It took awhile though, since my text was
contained within text frames. So I had to work with the
Document.Shapes property and whatnot.

In searching for a solution I did run across a VBA code snippet that
would save off each document separately after the mail merge
completed. At least now I have a totally automated solution, although
it’s cobbled together from various sources. First I pull my data from
a SQL DB using Ruby, dumping that to an Excel data source. Then I have
a C# program that takes that data source, uses a Word mail merge
template and delivers the final document set. Finally, I have a Ruby
program that looks in that save directory and e-mails the documents to
the individual employees. Eventually it’d be a lot cleaner and easier
to maintain if I had all of the work done in a single program written
in a single language. But that’s another fight for another day :slight_smile:

Well, if anybody can figure this out, it's you.

Robert K. wrote:

On 18.01.2009 03:18, gregarican wrote:

I wound up writing a C# console program to do the work. I just
referred to ugly underbelly of all of the Word COM stuff and was able
to grab what I needed. It took awhile though, since my text was
contained within text frames. So I had to work with the
Document.Shapes property and whatnot.

I’d say that’s pretty fast. Good job!

in a single language. But that’s another fight for another day :slight_smile:
May I suggest a different approach? Since your primary step is pulling
data from a relational DB using Ruby, you could as well do this: open
the mail merge Word template, replace mail merge fields with text with
special formatting (for example “<<>>” or whatever doesn’t
collide with RTF meta sequences). Then you save this as RTF file (ASCII
readable). Now you only need to read in the mail template file from
Ruby, do all the replacements and then write it out in Ruby again once
for each record. Sounds pretty simple IMHO.

Kind regards

robert

FYI, a similar (though not necessarily better) solution using Find &
Replace in Word is demonstrated here:

Ruby on Windows: Find & Replace with MS Word

Greg: If you’re willing to share your C# code for automating Word, I,
for one, would like to see it. Feel free to email me, if you like.

David