Parsers for input to index?

dmonahan · April 25, 2007, 7:14pm

The documents we want to index come in many formats; e.g., HTML, PDF,
RTF, Word, Excel, etc., etc., etc. I’ve been searching to find parsers
that will translate each of these formats to indexable text, but have
had little success. Any help will be appreciated.

dmonahan · April 25, 2007, 7:40pm

Hi Dick,

you may need to turn to using some external tools.

something similar to this was discussed before and some tools suggested.

See: Indexing mostly-binary documents (.ppt) - Ferret - Ruby-Forum

assuming the text is stored ASCII single byte, you could fall back on
the “strings” command as a last resort. It should be installed already
on modern GNU/Linux distros. Try cygwin for windows. It reads in any
data and outputs all “printable character sequences”.

John.

On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote:

The documents we want to index come in many formats; e.g., HTML, PDF,
RTF, Word, Excel, etc., etc., etc. I’ve been searching to find parsers
that will translate each of these formats to indexable text, but have
had little success. Any help will be appreciated.

–
http://johnleach.co.uk

dmonahan · April 25, 2007, 8:24pm

Hello Dick, and all (first post),

Here are some more that I use:

HTML to text: Vilistextum

also lynx:

PDF to text: pdftotext, from Xpdf
http://www.foolabs.com/xpdf/

WordPerfect to text: wpd2text, from libwpd
http://libwpd.sourceforge.net/

Converting other text encodings: iconv
http://www.gnu.org/software/libiconv/

-Stuart Sierra