PDF text search in rails

ripan · June 3, 2008, 7:26am

is there any plugin which could search in PDF documents. For example,
user should be able to search for keywords in the PDF contents.

ripan · June 3, 2008, 2:42pm

Good morning -

On 3-Jun-08, at 1:25 AM, ripan wrote:

is there any plugin which could search in PDF documents. For example,
user should be able to search for keywords in the PDF contents.

Someone submitted a patch for acts_as_solr to index documents - read
the google group for this project

J

ripan · June 3, 2008, 2:47pm

is there any plugin which could search in PDF documents.

Maybe you can try this: http://raa.ruby-lang.org/project/rpdf2txt/
or JRoR and one of the many Java PDF libraries. I’m not aware of a
Rails plugin.

ripan · June 3, 2008, 3:26pm

Someone submitted a patch for acts_as_solr to index documents - read
the google group for this project

I didn’t think solr would do this, since it provides index and query
but not parsing of rich formats. However, there seems to be a patch
that extracts text (but not metadata) from rich documents into solr:
UpdateRichDocuments - Solr - Apache Software Foundation. The solr committers
are reluctant to use that patch, though, and would rather build a
bridge from Tika (Apache Tika – Apache Tika) to solr, even if
that is further down the road.

I did find the patch to acts_as_solr here:
http://www.nabble.com/Rich-Document-support-for-solr-ruby-and-acts_as_solr-p17161561.html
But since this patch relies on the uncommitted solr patch, I wouldn’t
rely on this being viable for the long-term.

A less tenuous solution may be to extract the text from a PDF via some
other library (perhaps rpdf2txt or PDFbox), and indexing it using the
standard acts_as_solr.

Mark.