I was wondering if it is possible to search word documents using ferret.
The actual text in a word document isn’t in a binary format - only the
formatting. Surely it would be possible to parse that?
Alex MacCaw wrote:
I was wondering if it is possible to search word documents using ferret.
The actual text in a word document isn’t in a binary format - only the
formatting. Surely it would be possible to parse that?
You might be able to use some of the extensions for M$ platform and ruby
to use COM to get the data. Or if you don’t want to run on M$ platform
you could possibly use Java’s POI from Jakarta to parse out the text and
put it into something that Ruby could then put into ferret.
Charlie
Charlie H. wrote:
Charlie
Or there’s Abiword - runs on all platforms, and ouputs nice text. If
you don’t want graphical dependencies, there’s wvWare, too. I’m using
it at the moment.
I successfully used the wv-utilities (wvText or wvHtml, on debian do
‘apt-get install wv’) to index word documents with Ferret.
Thanks Jens,
Is there any way to do this on windows - or I’ll just have to wait till
I deploy on linux.
On Sat, Nov 18, 2006 at 04:33:26PM +0100, Charlie H. wrote:
Alex MacCaw wrote:
I was wondering if it is possible to search word documents using ferret.
The actual text in a word document isn’t in a binary format - only the
formatting. Surely it would be possible to parse that?You might be able to use some of the extensions for M$ platform and ruby
to use COM to get the data. Or if you don’t want to run on M$ platform
you could possibly use Java’s POI from Jakarta to parse out the text and
put it into something that Ruby could then put into ferret.
I successfully used the wv-utilities (wvText or wvHtml, on debian do
‘apt-get install wv’) to index word documents with Ferret.
you can have a look at RDig (http://rubyforge.org/projects/rdig) to see
an example of how this could be done.
Jens
–
webit! Gesellschaft für neue Medien mbH www.webit.de
Dipl.-Wirtschaftsingenieur Jens Krämer [email protected]
Schnorrstraße 76 Tel +49 351 46766 0
D-01069 Dresden Fax +49 351 46766 66