Docx (or htm, html) editing / substitutions

addis_a · August 25, 2014, 10:31am

I am quite desperate! I am trying to build a tiny app able to replace
strings in text without destroying it’s formatting.

When I am testing my regex and substitutions on plain text, it works
perfectly as well as it works perfectly in the case of test of the docx
gem (which, unfortunately for now, destroys the formatting). When I
convert the file (docx) into htm, html, results are strange. I have
tried to remove ‘\r’, convert files into utf-8 explicitly in MS Word or
via dos2unix but some of strings remain unchanged. When I check the
source code, it seems there could be still some ‘returns’ or so, those
strings are usually at the end of the line.

The ‘unicode’ gem is needed for downsizing of regex substitutions.

The main idea is to improve bibliographic records from this version:

SMITH, J. K.; VALON, P.: Work…
to
Smith, J. K.; Valon, P.: Work…

Excuse those hairy lines, they are needed for my language). The script
is not yet polished, it needs the final decision how the infile will
look like:

#!/usr/bin/ruby
2 # encoding: utf-8
3
4 require “unicode”
5
6 class BiblioReplace
7
8 def initialize(file_in, file_out)
9 @file_in = file_in
10 @file_out = file_out
11 @subst01 =
“(?<=([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ]{1}))([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ][^.a-z\s>]{0,}),([\s\r\n])([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ.])”
12 @subst02 =
“([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ]).,\s([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ])([a-záéěíóúůýčďňřšťž])”
13 end
14
15 def file_work()
16 system(“tr -d ‘\r\n’ < #{@file_in} > #{@file_out}”)
18 @text = open(@file_out, ‘r’)
19 @output = File.open(@file_in, ‘w’)
20 @reg = @text.read.force_encoding(Encoding::UTF_8)
21 end
22
23 def text_capitals()
24 @reg.gsub!(/#{@subst01}/) { |s| Unicode::downcase($2) + ", " +
$4 }
26 end
33
34 def close_files()
35 @output.write(@reg)
36 @output.close()
37 @text.close()
39 end
41 end

Any help appreciated!

welblaud · August 26, 2014, 8:23pm

Docx is working for me. Try the following:

require ‘docx’
require ‘unicode’
doc = Docx::Document.open(‘bib.docx’)
doc.paragraphs.each do |par|
par.each_text_run do |trun|
# simple string manipulation,
# replace with what you need
trun.text = Unicode::downcase(trun.text)
end
end
doc.save(‘out.docx’)

I attached the output file I’m getting with this script. For my
installation, the created out.docx has got the same formatting as
bib.docx.

welblaud · August 26, 2014, 8:58pm

Thanks a lot for the test. I have tried that again with the more complex
regex shown above and unfortunatelly it really still destroys all
formatting. Than it seems like with classical Normal style applied
(sans-serif all, no italics) :o/ Probably it could be connected with a
more complicated substitutions…?

welblaud · August 26, 2014, 11:30pm

Well, that’s not what I’m getting. Running the following code on
bib.docx produces the file out.docx. I added some random formatting for
testing purposes to the first line. Please note that your regex fails
for a few cases when the formatting changes mid-word - #each_text_run
then yields only part of a word/name to the block.

require ‘unicode’
require ‘docx’

copied from your code

subst01 =
/(?<=([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ]{1}))([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ][^.a-z\s>]{0,}),([\s\r\n])([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ.])/

doc = Docx::Document.open(‘bib.docx’)
doc.paragraphs.each do |par|
par.each_text_run do |trun|
# copied from your code
trun.text = trun.text.gsub(subst01) { |s| Unicode::downcase($2)

", " + $4}
end
end
doc.save(‘out.docx’)

welblaud · August 27, 2014, 6:58am

Well, thanks a lot. I really don’t understand why it fails in case of
formatting on my server. Your notice about my regex is right, I am aware
of that but it is not intended to improve such cases, in contrary - it
helps reveal editor’s errors (e.g. if comma or dot are omitted and so
on).

I will test that again!

welblaud · August 27, 2014, 7:28am

I hope it worked this time.

Your code seems to be a bit more complicated - to track down bugs it
helps to test with a minimal working example. This has also got the
advantage that it can be posted here so we can reproduce the error
you’re getting. As somebody mentioned, it’s hard to debug invisible
code.

It seems the code from your first post wasn’t intended for docx files,
but just to make sure:

open files in binary mode, ‘rb’ instead of ‘r’
no need for system(‘tr -d …’)

I’m sorry if it came off as harsh, I just wanted to point it out in case
you were wondering why it wouldn’t work for some entries. It would
require a bit more code than just a different regex anyway.

Another thing, unless that is your coding style, ruby doesn’t require
parentheses for functions without arguments.

welblaud · August 27, 2014, 1:11pm

Well, another one round:

1 #!/usr/bin/ruby
2 # encoding: utf-8
3
4 require “docx”
5 require “unicode”
6
7 # Get docx file name and create Docx::Document object.
8 docx_file_name = ARGV[0]
9 doc = Docx::Document.open(docx_file_name)
…
13
14 # Loop through the paragraphs in the document.
15 doc.paragraphs.each do |p|
16 # Get text of the paragraph.
17 reg = p.text
18 # Replace if replacements present, or keep original string.
19 reg = reg.gsub!(/A/, “XX”)
20
25 p.text = reg
26 end
27
28 # Save the document to a different file.
29 doc.save(“edited-#{docx_file_name}”)

The app still says:

honza@honza-kvm:~/development_testing$ bundle exec ruby docx.rb
biblio_verzalky.docx
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/bookmark.rb:9:
warning: already initialized constant Docx::Elements::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/text.rb:6:
warning: already initialized constant Docx::Elements::Text::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/containers/text_run.rb:16:
warning: already initialized constant
Docx::Elements::Containers::TextRun::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/containers/paragraph.rb:11:
warning: already initialized constant
Docx::Elements::Containers::Paragraph::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here

The formatting is still destroyed. Example is minimal. Source and result
attached.

Thanks!

welblaud · August 27, 2014, 10:30am

Yes, you are right. The very first example was for substitutions made in
textfiles because I was not able to do it directly in docx before… I
will test docx gem more and let you know.

Thanks for the ‘rb’ tip!

welblaud · August 27, 2014, 2:18pm

Fixed the ‘already initialized constant TAG’ warning by converting to…

Download & install the newest version if you don’t like these warnings,
but it’s not critical. The code just tried to initialize a constant
multiple times:

TAG = ‘p’

I installed docx via gem, and it gave me these warnings as well, but
it’s not destroying the formatting for me.

########################################

More importantly,

doc.paragraphs.each do |p|
reg = p.text
reg = reg.gsub!(/A/, “XX”)
p.text = reg
end

is most certainly going to destroy your formatting. A paragraph consists
of multiple text_runs that may contain different formatting. By setting
the paragraph’s text with ‘p.text = reg’, you’re effectively ignoring
the formatting of the individual text runs.

Or to put it another way, p.text is a string that looks something like
this:

‘Author Name C.K, book title, 1928’

How could the docx gem know how to format your text when you simply
assign a raw string to p.text?

Please take another look at the code from my post and use
#each_text_run to fix this problem:

https://www.ruby-forum.com/topic/5495127#1156046

welblaud · August 27, 2014, 2:48pm

Wow, a bit closer! Thank you! Now it seems it preserve formatting.
However, it still omits some results. I consider the docx app being kind
of magic and don’t know how it exactly parse paragraphs (w:t via
nokogiri?).

When I try to check my regex in Rubular, it works well… Don’t know where
to continue.

If you don’t mind, I am attaching another one, much longer example.

Thanks!

My current code according to your advice:

1 #!/usr/bin/ruby
2 # encoding: utf-8
3
4 require “docx”
5 require “unicode”
6
7 docx_file_name = ARGV[0]
8 doc = Docx::Document.open(docx_file_name)
9
10 @subst01 =
“(?<=([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ]{1}))([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ][^.a-z\s>]{0,}),([\s\r\n])([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ.])”
11
12 doc.paragraphs.each do |p|
13 p.each_text_run do |trun|
14 trun.text = trun.text.gsub(/#{@subst01}/) { |s|
Unicode::downcase($2) + ", " + $4 }
15 end
16 end
17
18 doc.save(“edited-#{docx_file_name}”)

welblaud · August 27, 2014, 4:18pm

I consider the docx app being kind of magic and don’t know how it exactly
parse paragraphs (w:t via nokogiri?).

Run this and see for yourself:

require ‘docx’
puts Docx::Document.open(‘bib.docx’).paragraphs.first.inspect

Also, take a look at the documentation:

http://rubydoc.info/gems/docx/0.2.03/frames

##################################

The issue you’re having is that docx actually works and preserves the
formatting. And that WYSIWYG stands for What-You-See-Isnt-What-You-Get.

Take a look at the attached image. The entries that aren’t working
aren’t formatted the same way.

The docx file consists of content (the raw text) + formatting. You can
use this code to get an idea of what the raw text looks like:

require ‘docx’
doc = Docx::Document.open(‘biblio_verzalky.docx’)
doc.paragraphs.each do |par|
par.each_text_run do |trun|
puts trun.text
end
puts ‘-’ * 18
end

This produces:

ORNSTEIN, A, C., LEVINE, D. U.
Foundations of Education
. Boston : Houghton Mifflin 1997.

Pasch, M.,
a kol.
:
Od vzdělávacího programu k vyučovací hodině.
Praha, Portál 1998.

[rest omitted]

The first entry actually says “ORNSTEIN”, but the seconds entry says
“Pasch”.

###########################

Ruby has got a few useful functions to inspect elements.

par = Docx::Document.open(…).paragraphs.first

par.methods # array of methods you can call
par.instance_variables # array of all instance variables
par.instance_variable_get(:@node) #get variables w/o attr_reader
par.inspect # string listing all instance variables and their value

In particular, par.node gives you the Nokogiri node associated with the
element. Docx is a bit limited in what it can do, but it handles I/O for
you and you could always just access the xml directly. Capitals are
stored as a

<w:caps/>

tag. Something of a hack to get rid of the ‘Capitals’ formatting:

require ‘docx’
doc = Docx::Document.open(‘biblio_verzalky.docx’)
doc.paragraphs.each do |par|
par.node.search(“//*[local-name()=‘caps’]”).remove
par.each_text_run do |trun|
# more processing
end
end

doc.save(‘out.docx’)

welblaud · August 27, 2014, 5:26pm

I’m not certain exactly how you want to format it, but here’s a
suggestion. Run this code on <biblio_verzalky.docx> and it produces
<out.docx>.

require ‘unicode’
require ‘docx’

letters = ‘A-Za-zÁÉĚÍÓÚŮÝČĎŇŘŠŤŽáéěíóúůýčďňřšťž’
regex = /([#{letters}])([#{letters}]*)/

doc = Docx::Document.open(‘biblio_verzalky.docx’)
doc.paragraphs.each do |par|
par.node.search("//*[local-name()=‘caps’]").remove
end

doc.paragraphs.each do |par|
par.each_text_run do |trun|
break if trun.formatting[:italic]
trun.text = trun.text.gsub(regex) do
Unicode::upcase($1) + Unicode::downcase($2)
end
end
end

doc.save(‘out.docx’)

welblaud · August 28, 2014, 6:59am

Well, with the document version freshly saved in MS Word 2010 the app
says:

biblio_verzalky.docx:1: Invalid char \x03' in expression biblio_verzalky.docx:1:in': uninitialized constant PK
(NameError)

As far as I could find, the char should be the end of the text.

I hope we are very close!

welblaud · August 27, 2014, 9:34pm

It seems you will make my day! I had totally forgotten the one important
thing! Some records were badly formatted like capital letters, other
not. When I am trying to improve that in LibreOffice, than it works
without any error. Will test that on our production Word app and let you
know. In case it worked, I would like to use that in several similar
scenarios (many typical substitutions MS Word manages with
difficulties).

And more, your last notice about removing caps directly…mmm…awesome.

Thanks!

welblaud · August 28, 2014, 10:00am

Thanks a lot, I will save that for later. Your basic attitude solves
what I need now (the problem \x03 dissapeared suddently). The second
solution, including removing capitals and with other regex, solves the
problem too much globally, however, it is for me a great information how
it is possible to work with that!

In a case I won’t be able to continue, I will let you know.

Thanks!

welblaud · August 28, 2014, 8:16am

Running

ruby biblio_verzalky.docx

produces

biblio_verzalky.docx:1: Invalid char \x03' in expression biblio_verzalky.docx:1:in ': uninitialized constant PK (NameError)

Running the code from Docx (or htm, html) editing / substitutions - Ruby - Ruby-Forum
on the *docx file from the above post
Docx (or htm, html) editing / substitutions - Ruby - Ruby-Forum works fine for me and
produces the expected result.

Feel free to use any code I may have given you - it’s simple enough I
won’t claim any sort of ownership ; )

welblaud · August 29, 2014, 6:32pm

The xml of the file first_test.docx looks like this:

<w:r w:rsidRPr=“00CD2E21”>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t>Adler, M. D.</w:t>
</w:r>
<w:r w:rsidR=“009B09A9” w:rsidRPr=“00CD2E21”>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t>,</w:t>
</w:r>
<w:r w:rsidRPr=“00CD2E21”>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t xml:space=“preserve”> Posner, E. A.</w:t>
</w:r>

So the text you get from each_text_run won’t be the complete entry.

What are attributes “w:rsidR” and “w:rsidRPr” for?

“w:rsidR” is a Revision ID. Each new user on a doc has a new id,
and each of its modification is marked with its RsID.

http://www.tinybutstrong.com/plugins/opentbs/xml_synopsis.txt

So either have all edits done by the same user, or…

##################################################

There are many ways to handle this issue. Let me suggest one possible
solution: merge all the adjacent text_runs with the same formatting
(bold, italic, underline) into one text_run, so that the xml looks like
this:

<w:r>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t xml:space=“preserve”>Adler, M. D. – Posner, E. A., eds.: </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i/>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t>Cost-Benefit Analysis: Legal, Economic, and Philosophical
Perspectives. Chicago: University of Chicago Press, 2001.</w:t>
</w:r>
<w:r>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t>. Chicago:</w:t>
</w:r>
<w:r>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t xml:space=“preserve”>Applebaum, E. – Katz, E.: Measures of Risk
Aversion and Comparative Statics of Industry Equilibrium. </w:t>
</w:r>

Add the following code to the BiblioReplace class:

def merge!
@doc.paragraphs.each do |par|
truns = par.text_runs
buffer = ‘’
i = 0
n = truns.length - 1
truns.each do |trun|
buffer << trun.text
if i!=n && trun.formatting == truns[i+1].formatting
trun.text = ‘’
else
trun.text = buffer
buffer.clear
end
i += 1
end
end
end

And then call BiblioReplace#merge! before calling #dashes.

For reference I attached a working ruby script, the input file, and the
file created by the ruby script.

welblaud · August 29, 2014, 8:38pm

Wow, thanks, will try and test that… For now only one question, please

Is this magic a thing of Nokogiri || docx gem || both and more?

welblaud · August 29, 2014, 10:57am

Close to reliability

However, I have just met another one example where I really can’t figure
out what is going about. If you don’t mind, please, try these examples.
They include the same but in reversed order. One passes well, the other
has one record omitted.

The code here is alternative to the first one. I have turned the whole
thing into a class (don’t laugh here, I am trying to run that on a
server in Sinatra session). The first scenario (BIG LETTERS) function is
ommited here:

#!/usr/bin/ruby

encoding: utf-8

require “docx”
require “unicode”

class BiblioReplace < Docx::Document

def initialize(docx_file_name)
@docx_file_name = docx_file_name
@subst02 =
“([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ]).,\s([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ])([a-záéěíóúůýčďňřšťž])”
end

def open_file
@doc = Docx::Document.open(@docx_file_name)
end

def dashes
@doc.paragraphs.each do |p|
p.each_text_run do |trun|
trun.text = trun.text.gsub(/#{@subst02}/, “\1. – \2\3”)
end
end
end

def close_file
  @doc.save("OUT.docx")
end

end

welblaud · August 29, 2014, 8:56pm

Getting rid of all Capitals was Nokogiri magic, but this is neither
Nokogiri nor docx magic - it’s a docx hack because docx lacks several
features (see the author’s comments on his github page).

It doesn’t change anything, but style-wise, the code above should be

def merge!
    @doc.paragraphs.each do |par|
        truns = par.text_runs
        buffer = ''
        n = truns.length - 1
        truns.each_with_index do |trun, i|
            buffer << trun.text
            if i!=n && trun.formatting == truns[i+1].formatting
                trun.text = ''
            else
                trun.text = buffer
                buffer.clear
            end
        end
    end
end

Docx (or htm, html) editing / substitutions

copied from your code

ORNSTEIN, A, C., LEVINE, D. U.
Foundations of Education
. Boston : Houghton Mifflin 1997.

Pasch, M.,
a kol.
:
Od vzdělávacího programu k vyučovací hodině.
Praha, Portál 1998.

What are attributes “w:rsidR” and “w:rsidRPr” for?

encoding: utf-8

Docx (or htm, html) editing / substitutions

copied from your code

ORNSTEIN, A, C., LEVINE, D. U. Foundations of Education . Boston : Houghton Mifflin 1997.

Pasch, M., a kol. : Od vzdělávacího programu k vyučovací hodině. Praha, Portál 1998.

What are attributes “w:rsidR” and “w:rsidRPr” for?

encoding: utf-8

ORNSTEIN, A, C., LEVINE, D. U.
Foundations of Education
. Boston : Houghton Mifflin 1997.

Pasch, M.,
a kol.
:
Od vzdělávacího programu k vyučovací hodině.
Praha, Portál 1998.