Forum: Ruby on Rails How to Parse Microsoft Word Document

Cfedd5b23f7a8c92318276eed77d5e06?d=identicon&s=25 Hafiz Badrie Lubis (Guest)
on 2011-03-16 20:54
(Received via mailing list)
Hi People,

I just joined the group and I want to ask something about my problem.
I'm still learning Ruby on Rails and now I have a task to parse
Microsoft Word and store the content into database.

Do you have any suggestion how to do it?

FYI, I develop it under Unix Environment. So, I don't have a chance to
use win32ole on it, CMIIW.

I also have searched the internet about this. But all I found that I
need to use JRuby and combine it with Apache POI or else I need to use
win32ole. As far as I know, to use JRuby I need to create the rails
project also with JRuby but unfortunately I already created the
project with plain Ruby.

So, I don't know what to do anymore. Does anybody have clue?

Regards,

Hafiz Badrie Lubis
A47e0a6beeb9d048ff054fc1c3a97418?d=identicon&s=25 Walter Davis (walterdavis)
on 2011-03-16 21:50
(Received via mailing list)
On Mar 16, 2011, at 2:51 PM, Hafiz Badrie Lubis wrote:

>
> I also have searched the internet about this. But all I found that I
> need to use JRuby and combine it with Apache POI or else I need to use
> win32ole. As far as I know, to use JRuby I need to create the rails
> project also with JRuby but unfortunately I already created the
> project with plain Ruby.
>
> So, I don't know what to do anymore. Does anybody have clue?

I did a project in PHP quite a few years ago, and I used some
venerable unix cli converters to do this. I stored the files as is,
and then used these converters to rip out their text and stored that
in the database for searching. They aren't perfect, but they do a good
enough job for search results.

$translators = array(
  'pdf' => '/usr/local/bin/pdftotext ./pdf/%s.pdf -',
  'ppt' => '/usr/local/bin/catppt -d ascii ./ppt/%s.ppt',
  'xls' => '/usr/local/bin/xls2csv -d ascii ./xls/%s.xls',
  'doc' => '/usr/local/bin/catdoc -d ascii ./doc/%s.doc'
); //these translators all pipe to stdout, which means that shell_exec
will return their text value

Walter
960fd385d2d99c24f0ab40bf4ddaf03d?d=identicon&s=25 Scott Ribe (Guest)
on 2011-03-16 21:51
(Received via mailing list)
On Mar 16, 2011, at 12:51 PM, Hafiz Badrie Lubis wrote:

> But all I found that I
> need to use JRuby and combine it with Apache POI or else I need to use
> win32ole.

You can run poi as a separate process and then grab its output.

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
Fce8e97a4f558dcdd7d6a64f02c493a6?d=identicon&s=25 Vladimir Rybas (Guest)
on 2011-03-17 00:12
(Received via mailing list)
1. Convert .doc to .pdf with PyODConverter
http://www.artofsolving.com/opensource/pyodconverter

2. Convert .pdf to .tiff with ImageMagick

3. Process .tiff through Tesseract OCR and get .txt


On Wed, Mar 16, 2011 at 9:51 PM, Hafiz Badrie Lubis
Cfedd5b23f7a8c92318276eed77d5e06?d=identicon&s=25 Hafiz Badrie Lubis (Guest)
on 2011-03-17 03:08
(Received via mailing list)
Can you show it to me how to do it? Do you have a reference?
To make a collaboration between a rails project with JRuby codes.
960fd385d2d99c24f0ab40bf4ddaf03d?d=identicon&s=25 Scott Ribe (Guest)
on 2011-03-17 05:48
(Received via mailing list)
On Mar 16, 2011, at 5:10 PM, Vladimir Rybas wrote:

> 1. Convert .doc to .pdf with PyODConverter
> http://www.artofsolving.com/opensource/pyodconverter
>
> 2. Convert .pdf to .tiff with ImageMagick
>
> 3. Process .tiff through Tesseract OCR and get .txt

Wow, talk about a long slow way to potentially lose text flow and
introduce errors...

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
960fd385d2d99c24f0ab40bf4ddaf03d?d=identicon&s=25 Scott Ribe (Guest)
on 2011-03-17 05:56
(Received via mailing list)
On Mar 16, 2011, at 8:06 PM, Hafiz Badrie Lubis wrote:

> To make a collaboration between a rails project with JRuby codes.

It has nothing whatsoever to do with JRuby. You can run Java apps from
Ruby exactly like any other command-line process. I don't know if POI is
just a library, or has a full app utility as well. If it's just a lib,
you'd have to write the program, probably a half-dozen lines of Java.

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice
Cfedd5b23f7a8c92318276eed77d5e06?d=identicon&s=25 Hafiz Badrie Lubis (Guest)
on 2011-03-17 06:00
(Received via mailing list)
Ok thank you, Scott.
I'll try your advice.
A39970d780c506b26e9a8b71eda13df2?d=identicon&s=25 Walter McGinnis (Guest)
on 2011-03-17 10:59
(Received via mailing list)
I'm coming to this late and I've partially  deleted the thread, so I may
be way off base...

An old plugin might be of help:

https://github.com/kete/convert_attachment_to

It makes use of existing command line converter utility programs.

Cheers,
Walter
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.