On 8/27/07, Marcelo de Moraes S. [email protected] wrote:
I’m about to start building a document catalog in Rails, and it is basically
a catalog for .doc and .pdf documents. First, I would like to know if there
is anything like lucene for Rails (or Ruby) - maybe a Rails plugin?
Second: It would be nice if the user could search for a term and the search
engine could “look into” the available documents. Would it be possible
somehow to pre-index word and pdf documents so that they would be searchable
As I’m finding out, this is extremely complicated. The search part is
pretty easy, we are using solr right now for that. The indexing part
is another matter. We are looking at indexing terabytes of word, pdf,
html, and whatever other formats we can support. We are storing ms
documents in pdf, but pdf is hard to index so I’m looking at using
open office to convert to xml, index it at that point, then convert
to pdf. So far I haven’t found anything open source that works as
well as open office for document conversion.