How to read Microsoft document file in ruby on rails?

aris · September 13, 2012, 3:51pm

Hello Everyone,
I m looking for parsing doc/docx file in ruby on rails.
I have use File.open(‘filename’,‘r’), but it shows special
character
instead of the content of file .

Thanks.

rovin_varshney · September 13, 2012, 4:01pm

On Sep 13, 2012, at 7:35 AM, rovin varshney wrote:

Hello Everyone,
I m looking for parsing doc/docx file in ruby on rails.
I have use File.open(‘filename’,‘r’), but it shows special character
instead of the content of file .

If all you want is the text content of the files, you can try the
ancient Unix utility catdoc to do that. Just back-tick to that command
(and make sure it’s installed in your Web server’s path). The result
will not be pretty, but it will have all of the words in it.

Walter

rovin_varshney · September 15, 2012, 3:28pm

The docx format is actually pretty simple: it is a zipped set of
files. If you upload it to the server and unzip it, you’ll see a set
of xml files. You can poke around and figure out the format, or you
can find a spec on line.

rovin_varshney · September 15, 2012, 4:08pm

On Sep 15, 2012, at 7:27 AM, Paul wrote:

The docx format is actually pretty simple…

You are really cruel to toy with him like that

–
Scott R.
[email protected]
http://www.elevated-dev.com/
(303) 722-0567 voice

rovin_varshney · September 16, 2012, 12:17pm

Hi Walter Lee D. , Paul

     Please can u give some code snipet or give some more

clarification
about parsing doc file.

rovin_varshney · September 16, 2012, 3:00pm

On Sunday 16 September 2012 05:58 PM, Dheeraj K. wrote:

Hi Walter Lee D. , Paul

You are really cruel to toy with him like that

mailto:[email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Use of PDFTron may useful. google for “PDFTron Ruby Intigration”
programs

rovin_varshney · September 16, 2012, 6:13pm

For a start, here’s the man page for catdoc, which you will need to
install.

Then, read up on using the system() or backtick operators in a Ruby
script to engage it. You’ll need to have a path to the file you want to
process, which is highly dependent on the system you’re using to store
the files. In Paperclip, I made this processor to extract text from PDF
files (pdftotext is part of the same collection of utilities as catdoc,
I believe):

#lib/paperclip_processors/text.rb

module Paperclip

Handles extracting plain text from PDF file attachments

class Text < Processor

attr_accessor :whiny

# Creates a Text extract from PDF
def make
  src = @file
  dst = Tempfile.new([@basename, 'txt'].compact.join("."))
  command = <<-end_command
    "#{ File.expand_path(src.path) }"
    "#{ File.expand_path(dst.path) }"
  end_command

  begin
    success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",

command.gsub(/\s+/, " "))
Rails.logger.info “Processing #{src.path} to #{dst.path} in the
text processor.”
rescue PaperclipCommandLineError
raise PaperclipError, “There was an error processing the text
for #{@basename}” if @whiny
end
dst
end
end
end

Depending on how you are uploading your files, your mileage may vary. At
the very simplest, the command would be

text_contents = system(‘/usr/bin/catdoc
/root/relative/path/to/file.doc’)

But that’s hopelessly naive and will blow up on any error.

Walter

rovin_varshney · September 16, 2012, 2:29pm

Did you try googling? This was the third link I found.

http://deepakprasanna.blogspot.in/2011/06/parsing-pdfdocdocx-content-with-apache.html

Dheeraj K.

rovin_varshney · September 18, 2012, 7:41am

Hello Everyone,
Thanks everyone.Finally got a solution while searching things that
you
all had explained.
There is a docx gem for parsing docx file and docx-html for convert
it
into HTML.

require ‘docx’

d = Docx::Document.open(‘example.docx’)d.each_paragraph do |p|
puts dend

and for the docx file stored on s3 amazon.

Docx::Document.open(open(‘http://S3-URL/original.docx’,:ssl_verify_mode
=>
OpenSSL::SSL::VERIFY_NONE))

A big Thanks to All.