Merging two Word documents with Ruby?

I’ve got a bugger of a problem and I thought I’d toss it out there to
see if anyone can provide any guidance.

I’m working on an application that needs to merge two Microsoft Word
documents. However, the application will definitely run on a Linux
server, so Word won’t be installed.

My only thought would be to use the new XML format – maybe I can find a
way to merge two documents with those files.

Has anyone else had any experience merging Word documents in Ruby (and
Rails)? Any other experience in manipulating Word documents in other
ways?

Denver M.

Several points

  • What do you mean by “Merge”?.. Word documents have structure and the
    interleaving of lines or words would appear to make little sense.

  • Unless your application and user base is new, then you will have many
    files NOT in the XML format, in which case you would need to convert
    them - and would need Word installed somewhere. Perhaps you could
    reconsider your platform choice (to make the problem simpler) - or if
    you have no pre-existing documents reconsider your approach to make
    Word unecessary? Word can read a wide variety of document types
    (including HTML) - so perhaps this is another way to simplify your
    problem.

More details required…
Graham

Denver M. wrote:

My only thought would be to use the new XML format – maybe I can find a
way to merge two documents with those files.

The only way I see is to use openoffice. There must be a script
somewhere to run openoffice in batch convert mode. That way you can
convert the doc format to odf. ODF is xml based, so should be mergeable.
The xml based format of microsoft is not used yet. The first office
version that will support that is office 12 and not released yet

I’m working on an application that needs to merge two Microsoft Word
documents. However, the application will definitely run on a Linux
server, so Word won’t be installed.

There’s the POI Ruby bindings, although I’ve never used them myself and
have no idea how good they are.

POI Ruby Bindings

If that doesn’t work, I’d try wv and catdoc, respectively.

  • What do you mean by “Merge”?.. Word documents have structure and the
    interleaving of lines or words would appear to make little sense.

Thanks for your thoughts on this Graham. By “merge”, I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

Denver M. wrote:

Thanks for your thoughts on this Graham. By “merge”, I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

Microsoft word has something called a master document. Maybe you could
add a masterdocument that inclkudes both files+extra headings. This
masterdocument might be simple enouh that you can actually reverse
engineer it. (Create one in word once and just edit the parts you need
to edit with ruby).

On 12/21/05, Denver M. [email protected] wrote:

  • What do you mean by “Merge”?.. Word documents have structure and the
    interleaving of lines or words would appear to make little sense.

Thanks for your thoughts on this Graham. By “merge”, I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

This can actually be extremely complex, because a named style (such as
‘Body’, ‘Normal’, or ‘Heading 1’) can (and will) have different
properties (fonts, colors, sizes, margins, encoding, etc) in each of
the two documents. You will need to rename every style and style
reference in the second document in order to prevent the two from
colliding.

If you have a choice, don’t use Word document. Use RTF format instead.
RTF files can be opened the same way as Word documents, but are a lot
easier to process.

Lei

If your documents are properly structured using styles (which is rare)
and they share the same styles (and I mean the same styles), you can
try to use openoffice in remote command mode, convert the .doc into
…odt, parse the xml of both files, proceed to merge the XMLs and
rebuild an odt file; perhaps going through OOo again to have a .doc
back. But you will need to ensure that the styles are always converted
into something reliably identifiable.

FAO (the UN branch for food and agriculture) uses a template system
(thus forcing a set of styles) which is used to output RTF which is
converted into XML for storage. Are your documents existing legacy ones
or is this a new setup? If you’re building it all, then you might
seriously consider using openoffice all the way.

On Dec 21, 2005, at 7:06, Denver M. wrote:

Thanks for your thoughts on this Graham. By “merge”, I meant appending
one Word document to the end of another, but to make things more
complicated, I need to add text into the headings across the entire
document.

Does it still need to be a Word document when you’re done? An entirely
different approach would be to use some kind of Word file display
program and make PDFs of the files, then chain the PDFs together. Do
the headers by slapping a white block over the existing headers and
writing a new header over them.

Personally, my approach would be to abandon the project as just too
messy for words. :slight_smile:

OpenOffice.org can do the .doc to pdf conversion. I like your idea very
much, Dave. Maybe PostScript would be easier to fiddle with ex-post.

abiword can be used from the command line. See http://
www.advogato.org/person/msevior/diary.html?start=65

This might allow for this to happen

hi guys,

i have got a doubt .hopeu guy can help

I need to build a utility ,which if i run ,i need to merger two MS wor
documents & i should be able to print the meged document enabling us to
select the ptions of “remove header” & “remove footer”
& consecutively should print document with footer/header removed

help

On Dec 23, 2005, at 11:37, Daniel Calvelo wrote:

OpenOffice.org can do the .doc to pdf conversion. I like your idea very
much, Dave. Maybe PostScript would be easier to fiddle with ex-post.

Probably. If you have a program that lets you overlay one PDF page on
another, then your best bet is to output a PDF page with your header in
it. (I’d probably use TeX, or maybe script OSX’s TextEdit program, and
my copy of full Acrobat 4 for the page overlay.) The other alternative
would be to create (or have somebody create for you) an .eps with the
white box and a line of text in a program like Freehand or Illustrator.
If you pop open the .eps file in a text editor, you’ll find it not too
difficult to programmatically replace the text, although you won’t
easily be able to duplicate the kerning and other textual adjustments.
Have OpenOffice print to a postscript file, then figure out what you
can use as a page marker in order to embed the .eps in that file on
each page so that it comes after (and thus covers) the original
headers, if any. Then feed the modified .ps file into a PDF distiller.

That’s what I’d try, I think.

hi guys,

i have got a doubt .hopeu guy can help

I need to build a utility ,which if i run ,i need to merge two MS wor
documents & i should be able to print the merged document enabling us to
select the options of “remove header” & “remove footer”
& consecutively should print document with footer/header removed

help -pls