Parsers for input to index?

The documents we want to index come in many formats; e.g., HTML, PDF,
RTF, Word, Excel, etc., etc., etc. I’ve been searching to find parsers
that will translate each of these formats to indexable text, but have
had little success. Any help will be appreciated.

Hi Dick,

you may need to turn to using some external tools.

something similar to this was discussed before and some tools suggested.

See: http://www.ruby-forum.com/topic/103374

assuming the text is stored ASCII single byte, you could fall back on
the “strings” command as a last resort. It should be installed already
on modern GNU/Linux distros. Try cygwin for windows. It reads in any
data and outputs all “printable character sequences”.

John.

On Wed, 2007-04-25 at 19:14 +0200, Dick Monahan wrote:

The documents we want to index come in many formats; e.g., HTML, PDF,
RTF, Word, Excel, etc., etc., etc. I’ve been searching to find parsers
that will translate each of these formats to indexable text, but have
had little success. Any help will be appreciated.


http://johnleach.co.uk

Hello Dick, and all (first post),

Here are some more that I use:

HTML to text: Vilistextum
http://bhaak.dyndns.org/vilistextum/
also lynx:
http://lynx.browser.org/

PDF to text: pdftotext, from Xpdf
http://www.foolabs.com/xpdf/

WordPerfect to text: wpd2text, from libwpd
http://libwpd.sourceforge.net/

Converting other text encodings: iconv
http://www.gnu.org/software/libiconv/

-Stuart Sierra

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs