MS Word files and PDFs


#1

Hello,

I’m fairly new to the Ruby scene.
Is there any library that can read MS Word (.doc) files and extract the
pure
text…what about libs for PDF files?

Thanks folks,

M.


#2

You could use the win32ole library and read them yourself via OLE.


#3

On Sun, 23 Apr 2006, Mateo Barraza wrote:

I’m fairly new to the Ruby scene.
Is there any library that can read MS Word (.doc) files and extract the pure
text…what about libs for PDF files?

Hi,

There’s not a MS Word library that I know of that will easily allow you
to extract the pure text, but the OLE suggestion is a good idea. Another
method would be to save as WordprocessingML (XML) (if you have Word
2003) and use
either REXML or libxml-ruby (two Ruby XML libraries) to parse it (or
XSLT). If you’ve got XML, the
interesting nodes (if you really only want text) are the ‘w:t’ ones.

HTH,
Keith


#4

Thanks for your responses; I also found that the POI java project was
extended to support ruby:
http://jakarta.apache.org/poi/poi-ruby.html
Although, I think the win32ole solution is the best for simply
reading the content of the docs…

M


#5

Keith S. wrote:

You could use the win32ole library and read them yourself via OLE.

Hi,

could you provide code snippet


#6

On Feb 8, 2008 8:18 AM, Rajesh S. removed_email_address@domain.invalid
wrote:

Keith S. wrote:
I have found this most useful:
http://rubyonwindows.blogspot.com/
what you want should be hidden in there
http://rubyonwindows.blogspot.com/search/label/word

A most valuable read anyway.

Cheers
Robert


http://ruby-smalltalk.blogspot.com/


Whereof one cannot speak, thereof one must be silent.
Ludwig Wittgenstein