I built a system in Rails 2.3.8 that accepted PDF uploads and needed
to extract their text content using the venerable (read ancient)
pdftotext command-line utility. I had to jump through the following
hoops to make it work, and this might have some bearing on your
solution:
#model
has_attached_file :pdf,:styles => { :text => { :fake =>
‘variable’ } }, :processors => [:text]
after_post_process :extract_text
private
def extract_text
file = File.open("#{pdf.queued_for_write[:text].path}",“r”)
plain_text = “”
while (line = file.gets)
plain_text << Iconv.conv(‘ASCII//IGNORE’, ‘UTF8’, line)
end
self.plain_text = plain_text
end
#lib/paperclip_processors/text.rb
module Paperclip
Handles extracting plain text from PDF file attachments
class Text < Processor
attr_accessor :whiny
# Creates a Text extract from PDF
def make
src = @file
dst = Tempfile.new([@basename, 'txt'].compact.join("."))
command = <<-end_command
"#{ File.expand_path(src.path) }"
"#{ File.expand_path(dst.path) }"
end_command
begin
success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",
command.gsub(/\s+/, " "))
Rails.logger.info “Processing #{src.path} to #{dst.path} in
the text processor.”
rescue PaperclipCommandLineError
raise PaperclipError, “There was an error processing the text
for #{@basename}” if @whiny
end
dst
end
end
end
Within the environs of Paperclip, you can write processors that do
pretty much anything, and usually result in a new file saved as a new
format in the attachments hierarchy. Once that process is done, you
can access the result file and do other stuff with it. But I’m not
sure if that answers your question at all, since you don’t seem to be
facing the same problem I was.
If your form posts a file to Paperclip, you don’t get access to the
file parts of that form submission directly in your controller, unless
I’m missing something fundamental. But a processor can access them
directly, at a very low level.
Walter