How to Parse Microsoft Word Document

Hi People,

I just joined the group and I want to ask something about my problem.
I’m still learning Ruby on Rails and now I have a task to parse
Microsoft Word and store the content into database.

Do you have any suggestion how to do it?

FYI, I develop it under Unix Environment. So, I don’t have a chance to
use win32ole on it, CMIIW.

I also have searched the internet about this. But all I found that I
need to use JRuby and combine it with Apache POI or else I need to use
win32ole. As far as I know, to use JRuby I need to create the rails
project also with JRuby but unfortunately I already created the
project with plain Ruby.

So, I don’t know what to do anymore. Does anybody have clue?

Regards,

Hafiz Badrie Lubis

On Mar 16, 2011, at 2:51 PM, Hafiz Badrie Lubis wrote:

I also have searched the internet about this. But all I found that I
need to use JRuby and combine it with Apache POI or else I need to use
win32ole. As far as I know, to use JRuby I need to create the rails
project also with JRuby but unfortunately I already created the
project with plain Ruby.

So, I don’t know what to do anymore. Does anybody have clue?

I did a project in PHP quite a few years ago, and I used some
venerable unix cli converters to do this. I stored the files as is,
and then used these converters to rip out their text and stored that
in the database for searching. They aren’t perfect, but they do a good
enough job for search results.

$translators = array(
‘pdf’ => ‘/usr/local/bin/pdftotext ./pdf/%s.pdf -’,
‘ppt’ => ‘/usr/local/bin/catppt -d ascii ./ppt/%s.ppt’,
‘xls’ => ‘/usr/local/bin/xls2csv -d ascii ./xls/%s.xls’,
‘doc’ => ‘/usr/local/bin/catdoc -d ascii ./doc/%s.doc’
); //these translators all pipe to stdout, which means that shell_exec
will return their text value

Walter

On Mar 16, 2011, at 12:51 PM, Hafiz Badrie Lubis wrote:

But all I found that I
need to use JRuby and combine it with Apache POI or else I need to use
win32ole.

You can run poi as a separate process and then grab its output.


Scott R.
[email protected]
http://www.elevated-dev.com/
(303) 722-0567 voice

Can you show it to me how to do it? Do you have a reference?
To make a collaboration between a rails project with JRuby codes.

  1. Convert .doc to .pdf with PyODConverter
    http://www.artofsolving.com/opensource/pyodconverter

  2. Convert .pdf to .tiff with ImageMagick

  3. Process .tiff through Tesseract OCR and get .txt

On Wed, Mar 16, 2011 at 9:51 PM, Hafiz Badrie Lubis

On Mar 16, 2011, at 5:10 PM, Vladimir R. wrote:

  1. Convert .doc to .pdf with PyODConverter
    http://www.artofsolving.com/opensource/pyodconverter

  2. Convert .pdf to .tiff with ImageMagick

  3. Process .tiff through Tesseract OCR and get .txt

Wow, talk about a long slow way to potentially lose text flow and
introduce errors…


Scott R.
[email protected]
http://www.elevated-dev.com/
(303) 722-0567 voice

Ok thank you, Scott.
I’ll try your advice.

I’m coming to this late and I’ve partially deleted the thread, so I may
be way off base…

An old plugin might be of help:

It makes use of existing command line converter utility programs.

Cheers,
Walter

On Mar 16, 2011, at 8:06 PM, Hafiz Badrie Lubis wrote:

To make a collaboration between a rails project with JRuby codes.

It has nothing whatsoever to do with JRuby. You can run Java apps from
Ruby exactly like any other command-line process. I don’t know if POI is
just a library, or has a full app utility as well. If it’s just a lib,
you’d have to write the program, probably a half-dozen lines of Java.


Scott R.
[email protected]
http://www.elevated-dev.com/
(303) 722-0567 voice