Text Extraction and Indexing

elliottcable · April 20, 2007, 7:40pm

Long story short I am going to have to index and search uploaded files.
They will be in Word document, Excel, pdf, and text format. So what is
the
best way to extract information in RoR so that I can place the needed
text
into the database? There are command line utilities that will convert
word
to txt but I would prefer an in code solution if possible. Any
suggestions
on excel? The only thing I could find was a perl module.

I’ve decided to use acts_as_ferret as my indexing agent. Does anyone
have
any tips on using it other then
http://www.railsenvy.com/2007/2/19/acts-as-ferret-tutorial ?

–
Elliott C.
[email protected]
[email protected]

elliottcable · April 20, 2007, 7:55pm

Hi Elliott,

have a look at the ContentExtractor of RDig http://rdig.rubyforge.org/
this
might get you a good way regarding pdf and word. Though it uses command
line
utilities as far as I know.

Cheers,
Jan

2007/4/20, Elliott C. [email protected]:

?

–
Elliott C.
[email protected]
[email protected]

–
Jan P.
Rechtsanwalt

GrÃ¼nebergstraÃŸe 38
22763 Hamburg
Tel +49 (0)40 41265809 Fax +49 (0)40 380178-73022
Mobil +49 (0)171 3516667
http://www.inviado.de