Extract text from PDF file

dubstep · January 31, 2011, 6:12pm

Hi,
In my upcoming application we are uploading the pdf files.
After uploading the pdf file I have to extract the text from pdf and
display it to user.
can anyone tell me how to extract text from pdf file?
Is there any plugin or gem present for this?
Thanks,
Tushar

gandhi-tush · January 31, 2011, 6:33pm

On Jan 31, 2011, at 12:12 PM, Tushar G. wrote:

Hi,
In my upcoming application we are uploading the pdf files.
After uploading the pdf file I have to extract the text from pdf and
display it to user.
can anyone tell me how to extract text from pdf file?
Is there any plugin or gem present for this?
Thanks,
Tushar

I did this using Paperclip and defining a processor for Paperclip as
follows:

#lib/paperclip_processors/text.rb
module Paperclip

Handles extracting plain text from PDF file attachments

class Text < Processor

 attr_accessor :whiny

 # Creates a Text extract from PDF
 def make
   src = @file
   dst = Tempfile.new([@basename, 'txt'].compact.join("."))
   command = <<-end_command
     "#{ File.expand_path(src.path) }"
     "#{ File.expand_path(dst.path) }"
   end_command

   begin
     success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",

command.gsub(/\s+/, " "))
Rails.logger.info “Processing #{src.path} to #{dst.path} in
the text processor.”
rescue PaperclipCommandLineError
raise PaperclipError, “There was an error processing the text
for #{@basename}” if @whiny
end
dst
end
end
end

#app/models/document.rb
has_attached_file :pdf,:styles => { :text => { :fake =>
‘variable’ } }, :processors => [:text]
after_post_process :extract_text

private
def extract_text
file = File.open("#{pdf.queued_for_write[:text].path}",“r”)
plain_text = “”
while (line = file.gets)
plain_text << Iconv.conv(‘ASCII//IGNORE’, ‘UTF8’, line)
end
self.plain_text = plain_text #text column to hold the extracted
text for searching
end

I had to find and install the creaky-old pdftotext library on my
server (happily, there was an apt-get bundle for it) and configure the
path correctly. When Paperclip accepts a PDF upload, it creates a text
extraction of that file and saves it in system/pdfs/:id/text/
filename.pdf. Note that while it has a .pdf extension, the file itself
is actually just the plain text extracted from the original pdf. After
quite a lot of googling and begging my local Ruby group, I got the
recipe for ripping open that text file and reading it into a variable
to store on the record. The text you get out of pdftotext will vary
wildly in quality and comprehensiveness, but since all I needed was a
way to get a simple search system fed, it works fine for my needs. I
never show this text to anyone, just use it as the “keywords” for
search. You may want/need to present an editing field for the
administrator to clean up these extracted texts.

Walter

gandhi-tush · January 31, 2011, 6:37pm

pdftk, pdfbox (java), pdfkit

Garrett L.

gandhi-tush · January 31, 2011, 9:00pm

I don’t see how these relate to the question – they are apparently
designed to generate PDFs rather than to extract text from existing
PDF documents. Can you point to an example where these libraries can
be used in that fashion? I’d love to use something more professionally
developed than my own system.

Walter

gandhi-tush · January 31, 2011, 7:21pm

I wrote a plugin that requires attachment_fu and some unixy utilities
behind
the scenes for this several years back:

It works reliably in Rails 2.x apps. I haven’t tried it with Rails 3
yet.
You could fork it and update (make it work with PaperClip or Rails 3) it
you
like or just have a gander for example code.

Cheers,
Walter

On Tue, Feb 1, 2011 at 6:36 AM, Garrett L. <

gandhi-tush · January 31, 2011, 9:23pm

PDFBox is the library I’m using on a current project:

There is a link to “Extract Text” under Command Line Utilities. There is
also a section called “Text Extraction” under Tutorials.

There is a ruby command line utility that wraps PDFBox called Docsplit:
http://documentcloud.github.com/docsplit/ that might be worth looking
into.

For pdftk:
http://pdf-toolkit.rubyforge.org/classes/PDF/Toolkit.html#M000003

Hope this helps,
Garrett L.