Index word and pdf documents for full-text search

Hello list,

I’m about to start building a document catalog in Rails, and it is
basically
a catalog for .doc and .pdf documents. First, I would like to know if
there
is anything like lucene for Rails (or Ruby) - maybe a Rails plugin?

Second: It would be nice if the user could search for a term and the
search
engine could “look into” the available documents. Would it be possible
somehow to pre-index word and pdf documents so that they would be
searchable
?

Thanks,

Marcelo.

Hi Marcelo,

Take a look at acts_as_solr

http://acts_as_solr.railsfreaks.com/

It provides a simple Rails integration with Solr (a search server
based on Lucene that includes XML and JSON APIs).

Mike

On Aug 27, 3:41 pm, “Marcelo de Moraes S.” [email protected]

Hi Mike, thanks for the tip!

On 8/27/07, Marcelo de Moraes S. [email protected] wrote:

Hello list,

I’m about to start building a document catalog in Rails, and it is basically
a catalog for .doc and .pdf documents. First, I would like to know if there
is anything like lucene for Rails (or Ruby) - maybe a Rails plugin?

Second: It would be nice if the user could search for a term and the search
engine could “look into” the available documents. Would it be possible
somehow to pre-index word and pdf documents so that they would be searchable
?

As I’m finding out, this is extremely complicated. The search part is
pretty easy, we are using solr right now for that. The indexing part
is another matter. We are looking at indexing terabytes of word, pdf,
html, and whatever other formats we can support. We are storing ms
documents in pdf, but pdf is hard to index so I’m looking at using
open office to convert to xml, index it at that point, then convert
to pdf. So far I haven’t found anything open source that works as
well as open office for document conversion.

Chris