Parsers for input to index?

The documents we want to index come in many formats; e.g., HTML, PDF,
RTF, Word, Excel, etc., etc., etc. I’ve been searching to find parsers
that will translate each of these formats to indexable text, but have
had little success. Any help will be appreciated.

Hi Dick,

you may need to turn to using some external tools.

something similar to this was discussed before and some tools suggested.

See: Indexing mostly-binary documents (.ppt) - Ferret - Ruby-Forum

assuming the text is stored ASCII single byte, you could fall back on
the “strings” command as a last resort. It should be installed already
on modern GNU/Linux distros. Try cygwin for windows. It reads in any
data and outputs all “printable character sequences”.

John.

On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote:

The documents we want to index come in many formats; e.g., HTML, PDF,
RTF, Word, Excel, etc., etc., etc. I’ve been searching to find parsers
that will translate each of these formats to indexable text, but have
had little success. Any help will be appreciated.


http://johnleach.co.uk

Hello Dick, and all (first post),

Here are some more that I use:

HTML to text: Vilistextum

also lynx:

PDF to text: pdftotext, from Xpdf
http://www.foolabs.com/xpdf/

WordPerfect to text: wpd2text, from libwpd
http://libwpd.sourceforge.net/

Converting other text encodings: iconv
http://www.gnu.org/software/libiconv/

-Stuart Sierra