Copying text from MS Word and wrapping in HTML - help please

aris · June 27, 2012, 12:12pm

Hi,

I’m new to Ruby and was wondering if someone can help me out with this.
I build websites on a daily basis and use HTML all the time, but seem to
receive content people want publishing in Word documents all the time
and have to simply wrap text in paragraph tags, put ul tags around
unordered lists and various other simple and repetitive tasks.

Therefore I gather I could use Ruby to do some of these tedious tasks
for
me, but not sure how to do this. I’ve made a start, but doesn’t seem to
work. The one thing I’m really unsure of how to do is to recognise where
paragraphs and lists start and end in order to wrap the HTML around it.
Any help would be appreciated.

require ‘rubygems’
require ‘win32/clipboard’
include Win32

get the text from the clipboard

text = Clipboard.data

clean up and wrap in HTML

sometext

’
when ‘• List item 1
• List item 2’: ‘

List item 1
List item
2

’
end

send it back to the clipboard

Clipboard.set_data(text)

displays message

print “Cleaning up clipboard…”

Thanks in advance.

omgitsads · June 27, 2012, 12:33pm

Hi,

What’s the exact format of the Word files? It might be easier to parse
the original file with all the structure information rather than the
extracted plaintext. For example, .docx files are XML and can easily
processed with the Nokogiri library. I’ve actually used it to transform
Word documents into simple HTML.

In general, you can recognize paragraphs by looking for empty lines (two
or more newlines with only space in between). The lists can be converted
by starting a new list at the first list entry and closing it at the
last one. This requires a more complex logic than you currently use.

By the way, you don’t have to replace special characters by hand. The
Ruby libraries already have methods for this (I don’t remember the exact
name, but it’s probably in CGI).

omgitsads · June 27, 2012, 12:42pm

Hi,

Thank you for your reply. Yes the documents comes through generally as
Word 2007 or 2010 documents, so with an extension of docx. Ok I see, so
how did you do yours if you don’t mind showing bits of your code? That’s
good, didn’t know Ruby had its own method on replacing special
characters, will look into this.

Thanks for your help.

omgitsads · June 27, 2012, 12:54pm

OK, it’s in the attachments.

The DocxTree reads a docx file and builds a tree consisting of Node
objects. Each Node represents a basic text element (paragraphs, headings
etc.). You can then specify rules for how to transform the Node objects
into HTML (or anything else). The transformation works recursively, so
you can build a full HTML fragment by transforming the root node.

omgitsads · June 27, 2012, 1:09pm

Thank you very much. So I presume I have to point the location below to
the file location on my computer? Does this mean I need to convert the
Word document into XML format, or can I put the location as for example

file.read ‘C:\Users\me\Documents\myDocument.docx’ ?

file.read ‘word/document.xml’

Thanks once again for your help, really appreciate it.

omgitsads · June 27, 2012, 1:22pm

You only have to supply the path to the docx file:

tree = DocxTree.new ‘C:/Users/me/Documents/myDocument.docx’

The initialize method will unzip the XML file of the document and parse
it with Nokogiri.

omgitsads · June 27, 2012, 2:08pm

The files only supply the classes. In order to use them, you have to
create your own script, include docx_tree.rb and then instantiate the
DocxTree class.

Let’s say you name your script “main.rb” and put the two files into the
subdirectory “classes”. Then main.rb would have to look something like
this:

require_relative ‘classes/docx_tree.rb’ # include the file

the actual code

my_docx_file = ‘C:/Users/me/Documents/myDocument.docx’
docx_tree = DocxTree.new my_docx_file # use the DocxTree class

do something with docx_tree …

By the way, you need the zip gem and the nokogiri gem for the class to
work. So if you haven’t installed them yet, do it before running the
code.

And you’ll probably have to customize the classes. For example, there
isn’t a method to transform the docx tree directly yet.

omgitsads · June 27, 2012, 9:20pm

I see, thank you very much for your help. Being new to Ruby, think I
need to brush up on my skills before I can take this on as I’m
struggling to get this to work. However hopefully I will be able to get
it up and working in time.

Thanks for all your help.

omgitsads · June 27, 2012, 1:48pm

Thanks. Sorry if I’m missing the obvious here but looking in the two
Ruby files you supplied I can’t see anywhere the line tree =
DocxTree.new? Or does that need adding into the docx_tree.rb file
somewhere?

omgitsads · June 28, 2012, 2:02pm

On Thu, Jun 28, 2012 at 06:24:11PM +0900, Graham Menhennitt wrote:

I build websites on a daily basis and use HTML all the time,
but seem to receive content people want publishing in Word
documents all the time and have to simply wrap text in
paragraph tags, put ul tags around unordered lists and various
other simple and repetitive tasks.
I’m not sure that Ruby is the right tool for this task. Why not
use LibreOffice to read the Word files and then export them as HTML.

There are antiword and catdoc out there (but these would rather
handle .doc and not .docx).

omgitsads · June 28, 2012, 11:25am

On 27/06/2012 20:12, Adam Holloway wrote:

I build websites on a daily basis and use HTML all the time, but seem to
receive content people want publishing in Word documents all the time
and have to simply wrap text in paragraph tags, put ul tags around
unordered lists and various other simple and repetitive tasks.

Therefore I gather I could use Ruby to do some of these tedious tasks
for me, but not sure how to do this.
I’m not sure that Ruby is the right tool for this task. Why not use
LibreOffice to read the Word files and then export them as HTML. I’m
sure it’s possible to automate this in LO too, but I’ll leave that up to
somebody else.

Graham