Problem processing text file after uploading

I’ve got a web-app currently partially working. The user uploads a .txt,
.docx or .doc file to the server.

Currently the model handles those files, saves some metadata (the
extention and orig filename) then saves the file to the hard drive. Next
it converts the doc and docx files to plain text and saves the output to
a txt file.

My problem is I want to copy the plain text contents of those txt files
to the :body field in my database, but by the time those files are
written no more changes can be sent to the data base (because all the
file handling is done in after_save)

Where or how do I sanely get the contents of those TXT files into the
database?

See model attached:

On Jul 7, 2012, at 11:11 AM, David M. wrote:

written no more changes can be sent to the data base (because all the
file handling is done in after_save)

Where or how do I sanely get the contents of those TXT files into the
database?

I built this feature in my first commercial Rails app. I used Paperclip
for my file storage, which offers its own callback called
‘after_post_process’ that worked out perfectly for me.

First, I created a Paperclip processor to extract the text version of
the uploaded file (mine were all PDF).

/lib/paperclip_processors/text.rb

module Paperclip

Handles extracting plain text from PDF file attachments

class Text < Processor

attr_accessor :whiny

# Creates a Text extract from PDF
def make
  src = @file
  dst = Tempfile.new([@basename, 'txt'].compact.join("."))
  command = <<-end_command
    "#{ File.expand_path(src.path) }"
    "#{ File.expand_path(dst.path) }"
  end_command

  begin
    success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", 

command.gsub(/\s+/, " "))
Rails.logger.info “Processing #{src.path} to #{dst.path} in the
text processor.”
rescue PaperclipCommandLineError
raise PaperclipError, “There was an error processing the text
for #{@basename}” if @whiny
end
dst
end
end
end

Then in my document.rb (model for the file attachment), I added the
following bits:

has_attached_file :pdf,:styles => { :text => { :fake => ‘variable’ }
}, :processors => [:text]

after_post_process :extract_text

private
def extract_text
file = File.open("#{pdf.queued_for_write[:text].path}",“r”)
plain_text = “”
while (line = file.gets)
plain_text << Iconv.conv(‘ASCII//IGNORE’, ‘UTF8’, line)
end
self.plain_text = plain_text
end

And that was that.

Walter

But…paperclip is OLD and unmaintained, and this is also a learning
project.

So is there some (best practices) way to do the following things without
having to make another pass over my doc_file or using paperclip:

  1. upload .doc and store metadata
  2. convert to plain text and write .txt to hard drive
  3. grab contents of .txt file an store in database

On Sat, Jul 7, 2012 at 8:11 AM, David M. [email protected] wrote:

Currently the model handles those files, saves some metadata (the
extention and orig filename) then saves the file to the hard drive. Next
it converts the doc and docx files to plain text and saves the output to
a txt file.

My problem is I want to copy the plain text contents of those txt files
to the :body field in my database, but by the time those files are
written no more changes can be sent to the data base (because all the
file handling is done in after_save)

Wouldn’t the obvious answer be to do the file handling in before_save?

And is there a reason to write the text to a file in the first place if
you’re
just going to save it in the DB?


Hassan S. ------------------------ [email protected]

twitter: @hassan

Hassan S. wrote in post #1067807:

On Sat, Jul 7, 2012 at 8:11 AM, David M. [email protected] wrote:

Currently the model handles those files, saves some metadata (the
extention and orig filename) then saves the file to the hard drive. Next
it converts the doc and docx files to plain text and saves the output to
a txt file.

My problem is I want to copy the plain text contents of those txt files
to the :body field in my database, but by the time those files are
written no more changes can be sent to the data base (because all the
file handling is done in after_save)

Wouldn’t the obvious answer be to do the file handling in before_save?

And is there a reason to write the text to a file in the first place if
you’re
just going to save it in the DB?


Hassan S. ------------------------ [email protected]
Hassan Schroeder | about.me
twitter: @hassan

The file handling code I have doesn’t seem to function unless it happens
after_save, I’m not sure why that is.

The idea about saving the txt files to disk is so that the client can
download them via ftp.

edit:
And to answer the next question, the reason we want the body of the
txt’s in the db is for search functionality.

On Sat, Jul 7, 2012 at 9:21 AM, David M. [email protected] wrote:

The file handling code I have doesn’t seem to function unless it happens
after_save, I’m not sure why that is.

Well, since it’s a “learning project” maybe that would be a good place
to start :slight_smile:

Alternatively, you might consider pushing the doc-to-text conversion
into a background job, which adds the text of the db record once it’s
finished. Or use an Observer to add the text after after_save.

Multiple possibilities…


Hassan S. ------------------------ [email protected]

twitter: @hassan

On 7 July 2012 17:21, David M. [email protected] wrote:

written no more changes can be sent to the data base (because all the
Hassan Schroeder | about.me
twitter: @hassan

The file handling code I have doesn’t seem to function unless it happens
after_save, I’m not sure why that is.

The idea about saving the txt files to disk is so that the client can
download them via ftp.

With files it is often better just to store them in files and not in
the database. Certainly they should not be stored in both file and
database.

Colin

On 7 July 2012 18:02, David M. [email protected] wrote:

Hassan S. wrote in post #1067812:

On Sat, Jul 7, 2012 at 9:21 AM, David M. [email protected] wrote:

The file handling code I have doesn’t seem to function unless it happens
after_save, I’m not sure why that is.

Well, since it’s a “learning project” maybe that would be a good place
to start :slight_smile:

Any hints?

Have a look at the Rails Guide on debugging for techniques that can be
used to debug your code. If you still can’t work out what is going on
then come back with the details of the section of code that is failing
to so what you expect.

Colin

On Sat, Jul 7, 2012 at 10:02 AM, David M. [email protected] wrote:

The file handling code I have doesn’t seem to function unless it happens
after_save, I’m not sure why that is.

Well, since it’s a “learning project” maybe that would be a good place
to start :slight_smile:

Any hints?

Start by defining exactly what “doesn’t seem to function” means :slight_smile:


Hassan S. ------------------------ [email protected]

twitter: @hassan

Hassan S. wrote in post #1067817:

On Sat, Jul 7, 2012 at 10:02 AM, David M. [email protected] wrote:

The file handling code I have doesn’t seem to function unless it happens
after_save, I’m not sure why that is.

Well, since it’s a “learning project” maybe that would be a good place
to start :slight_smile:

Any hints?

Start by defining exactly what “doesn’t seem to function” means :slight_smile:


Hassan S. ------------------------ [email protected]
Hassan Schroeder | about.me
twitter: @hassan

When outside of after_save, a database entry gets created, but file_data
doesn’t get saved to the hard drive.

On 7 July 2012 18:12, David M. [email protected] wrote:

Start by defining exactly what “doesn’t seem to function” means :slight_smile:


Hassan S. ------------------------ [email protected]
Hassan Schroeder | about.me
twitter: @hassan

When outside of after_save, a database entry gets created, but file_data
doesn’t get saved to the hard drive.

You need to do some debugging to see what is going on. Is the save
failing or is it not getting to the save statement for some reason?
Having worked out which of those is happening then do more debugging
to find out why.

Colin

Hassan S. wrote in post #1067812:

On Sat, Jul 7, 2012 at 9:21 AM, David M. [email protected] wrote:

The file handling code I have doesn’t seem to function unless it happens
after_save, I’m not sure why that is.

Well, since it’s a “learning project” maybe that would be a good place
to start :slight_smile:

Any hints?

I know you guys seem to be sticking to the RTFM hardline, but it seems
as though debugging in the model has very few options without importing
a bunch of gems.

Even on the page recommended there are 35 mentions of controller, and
only 4 mentions of model.

I installed debugger ‘gem install debugger’, but it doesn’t integrate at
all with webrick (‘rails s’) and there apparently is no ruby-debug for
1.9.3 (ughh…)

I’ve put a bunch of logger.info in my model, but I now know no more than
I did before.

When store_docfile is called before after_save, it never even gets to
the first line containing the logger.info “we are now in store_docfile”
message.

I have a feeling this might be something deeper than a tiny typo shrug

If one of you could PLEASE just look at my model and help me figure out
what’s up, it would be appreciated.

On Sat, Jul 7, 2012 at 1:24 PM, David M. [email protected] wrote:

When store_docfile is called before after_save, it never even gets to
the first line containing the logger.info “we are now in store_docfile”
message.

I don’t see any obvious problems in your original file.

If not with after_save, how are you calling store_docfile now? You
might want to post your new code for the model (and controller).


Hassan S. ------------------------ [email protected]

twitter: @hassan

On Sat, Jul 7, 2012 at 10:12 AM, David M. [email protected] wrote:

When outside of after_save, a database entry gets created, but file_data
doesn’t get saved to the hard drive.

OK, why not?

As Colin suggested, study the debugging guide (or just put logging
statements in the code to see what’s happening at each step).


Hassan S. ------------------------ [email protected]

twitter: @hassan

On Sat, Jul 7, 2012 at 3:55 PM, David M. [email protected] wrote:

When store_docfile is called before after_save, it never even gets to
the first line containing the logger.info “we are now in store_docfile”
message.

In your new example file, it’s no surprise you’re not seeing anything –
you’re never calling store_docfile at all. (No, that random standalone
:store_docfile doesn’t do what you’re hoping it does.)

Either invoke it from a before_save, or make it a non-private method
(at least temporarily) and invoke it explicitly from your controller and
see what happens.


Hassan S. ------------------------ [email protected]

twitter: @hassan

Hassan S. wrote in post #1067836:

On Sat, Jul 7, 2012 at 1:24 PM, David M. [email protected] wrote:

When store_docfile is called before after_save, it never even gets to
the first line containing the logger.info “we are now in store_docfile”
message.

I don’t see any obvious problems in your original file.

If not with after_save, how are you calling store_docfile now? You
might want to post your new code for the model (and controller).


Hassan S. ------------------------ [email protected]
Hassan Schroeder | about.me
twitter: @hassan

The controller is a typical unmodified scaffolded CRUD/REST.

The (non functional) model is attached.

On Saturday, 7 July 2012 11:44:12 UTC-4, Ruby-Forum.com User wrote:

But…paperclip is OLD and unmaintained, and this is also a learning
project.

Perhaps you could start by “learning” how to decide whether a gem is
unmaintained. For instance:

doesn’t exactly look like “no activity” to me…

–Matt J.

On 7 July 2012 21:24, David M. [email protected] wrote:

I’ve put a bunch of logger.info in my model, but I now know no more than
I did before.

When store_docfile is called before after_save, it never even gets to
the first line containing the logger.info “we are now in store_docfile”
message.

That is the clue then, but you are misinterpreting what you are
seeing. If it is not getting to the first line then it is not in fact
calling the method at all. Check out how you are calling it.

Colin