Cleanly handling sub-generatede files with Paperclip

Hi,

Let’s say I upload a pdf file. Imagemagick extracts all pages out of it
and stores the png images on the hard-drive. How to easily handle all
these generated files with Paperclip?

Has anyone done that before? Thanks for your advice

Fernando P. wrote:

Hi,

Let’s say I upload a pdf file. Imagemagick extracts all pages out of it
and stores the png images on the hard-drive. How to easily handle all
these generated files with Paperclip?

Has anyone done that before? Thanks for your advice

I’ve done precisely this just recently. It isn’t as tricky as it seems,
really. All you need are a few steps in your pdf processor that will
take the extracted images and add them to a new record. So, if you have
the following relationship:

class Document < ActiveRecord::Base
has_many :images
has_attached_file :file, :styles => { :original => {} }, :processors
=> [:extract]
end

class Image < ActiveRecord::Base
belongs_to :document
has_attached_file :image
end

In your processor perform your extraction to a temporary folder, and
after it is done do something like the following:

if @attachment.respond_to?(:instance) and
@attachment.instance.respond_to?(:images)
@attachment.instance.images.destroy_all

Dir.glob("#{@temporary}/*.{jpg,png}").each do |path|
File.open(path) { |file| @attachment.instance.images.create(:image
=> file) }
end
else
raise PaperclipError, “Unable to save extracted pages. No valid
attachment.”
end

Afterwards make sure to remove the temporary folder and you should be
good.

Parker S. wrote:
Interesting approach. In particular problem you ran into in practice?
Too many files for the fs? Database blowing up? Other?

Fernando P. wrote:

Interesting approach. In particular problem you ran into in practice?
Too many files for the fs? Database blowing up? Other?

It has worked really well in practice. The failing point was always
ImageMagick, really. We ended up using pdf2image instead, which yielded
much better output, much faster. We’ve processed 120+ page documents, so
the file issue wasn’t a problem. With the time it takes to process the
images (assuming you are resizing / thumbnailing) you’ll certainly want
to process with a background processor though–Delayed Job, Resque or
the like.