I am quite desperate! I am trying to build a tiny app able to replace
strings in text without destroying it’s formatting.
When I am testing my regex and substitutions on plain text, it works
perfectly as well as it works perfectly in the case of test of the docx
gem (which, unfortunately for now, destroys the formatting). When I
convert the file (docx) into htm, html, results are strange. I have
tried to remove ‘\r’, convert files into utf-8 explicitly in MS Word or
via dos2unix but some of strings remain unchanged. When I check the
source code, it seems there could be still some ‘returns’ or so, those
strings are usually at the end of the line.
The ‘unicode’ gem is needed for downsizing of regex substitutions.
The main idea is to improve bibliographic records from this version:
SMITH, J. K.; VALON, P.: Work…
to
Smith, J. K.; Valon, P.: Work…
Excuse those hairy lines, they are needed for my language). The script
is not yet polished, it needs the final decision how the infile will
look like:
require ‘docx’
require ‘unicode’
doc = Docx::Document.open(‘bib.docx’)
doc.paragraphs.each do |par|
par.each_text_run do |trun|
# simple string manipulation,
# replace with what you need
trun.text = Unicode::downcase(trun.text)
end
end
doc.save(‘out.docx’)
I attached the output file I’m getting with this script. For my
installation, the created out.docx has got the same formatting as
bib.docx.
Thanks a lot for the test. I have tried that again with the more complex
regex shown above and unfortunatelly it really still destroys all
formatting. Than it seems like with classical Normal style applied
(sans-serif all, no italics) :o/ Probably it could be connected with a
more complicated substitutions…?
Well, that’s not what I’m getting. Running the following code on
bib.docx produces the file out.docx. I added some random formatting for
testing purposes to the first line. Please note that your regex fails
for a few cases when the formatting changes mid-word - #each_text_run
then yields only part of a word/name to the block.
doc = Docx::Document.open(‘bib.docx’)
doc.paragraphs.each do |par|
par.each_text_run do |trun|
# copied from your code
trun.text = trun.text.gsub(subst01) { |s| Unicode::downcase($2)
Well, thanks a lot. I really don’t understand why it fails in case of
formatting on my server. Your notice about my regex is right, I am aware
of that but it is not intended to improve such cases, in contrary - it
helps reveal editor’s errors (e.g. if comma or dot are omitted and so
on).
Your code seems to be a bit more complicated - to track down bugs it
helps to test with a minimal working example. This has also got the
advantage that it can be posted here so we can reproduce the error
you’re getting. As somebody mentioned, it’s hard to debug invisible
code.
It seems the code from your first post wasn’t intended for docx files,
but just to make sure:
open files in binary mode, ‘rb’ instead of ‘r’
no need for system(‘tr -d …’)
I’m sorry if it came off as harsh, I just wanted to point it out in case
you were wondering why it wouldn’t work for some entries. It would
require a bit more code than just a different regex anyway.
Another thing, unless that is your coding style, ruby doesn’t require
parentheses for functions without arguments.
1 #!/usr/bin/ruby
2 # encoding: utf-8
3
4 require “docx”
5 require “unicode”
6
7 # Get docx file name and create Docx::Document object.
8 docx_file_name = ARGV[0]
9 doc = Docx::Document.open(docx_file_name)
…
13
14 # Loop through the paragraphs in the document.
15 doc.paragraphs.each do |p|
16 # Get text of the paragraph.
17 reg = p.text
18 # Replace if replacements present, or keep original string.
19 reg = reg.gsub!(/A/, “XX”)
20
25 p.text = reg
26 end
27
28 # Save the document to a different file.
29 doc.save(“edited-#{docx_file_name}”)
The app still says:
honza@honza-kvm:~/development_testing$ bundle exec ruby docx.rb
biblio_verzalky.docx
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/bookmark.rb:9:
warning: already initialized constant Docx::Elements::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/text.rb:6:
warning: already initialized constant Docx::Elements::Text::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/containers/text_run.rb:16:
warning: already initialized constant
Docx::Elements::Containers::TextRun::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/containers/paragraph.rb:11:
warning: already initialized constant
Docx::Elements::Containers::Paragraph::TAG
/home/honza/development_testing/vendor/bundle/gems/docx-0.2.03/lib/docx/elements/element.rb:13:
warning: previous definition of TAG was here
The formatting is still destroyed. Example is minimal. Source and result
attached.
Yes, you are right. The very first example was for substitutions made in
textfiles because I was not able to do it directly in docx before… I
will test docx gem more and let you know.
Fixed the ‘already initialized constant TAG’ warning by converting to…
Download & install the newest version if you don’t like these warnings,
but it’s not critical. The code just tried to initialize a constant
multiple times:
TAG = ‘p’
I installed docx via gem, and it gave me these warnings as well, but
it’s not destroying the formatting for me.
########################################
More importantly,
doc.paragraphs.each do |p|
reg = p.text
reg = reg.gsub!(/A/, “XX”)
p.text = reg
end
is most certainly going to destroy your formatting. A paragraph consists
of multiple text_runs that may contain different formatting. By setting
the paragraph’s text with ‘p.text = reg’, you’re effectively ignoring
the formatting of the individual text runs.
Or to put it another way, p.text is a string that looks something like
this:
‘Author Name C.K, book title, 1928’
How could the docx gem know how to format your text when you simply
assign a raw string to p.text?
Please take another look at the code from my post and use #each_text_run to fix this problem:
Wow, a bit closer! Thank you! Now it seems it preserve formatting.
However, it still omits some results. I consider the docx app being kind
of magic and don’t know how it exactly parse paragraphs (w:t via
nokogiri?).
When I try to check my regex in Rubular, it works well… Don’t know where
to continue.
If you don’t mind, I am attaching another one, much longer example.
The issue you’re having is that docx actually works and preserves the
formatting. And that WYSIWYG stands for What-You-See-Isnt-What-You-Get.
Take a look at the attached image. The entries that aren’t working
aren’t formatted the same way.
The docx file consists of content (the raw text) + formatting. You can
use this code to get an idea of what the raw text looks like:
require ‘docx’
doc = Docx::Document.open(‘biblio_verzalky.docx’)
doc.paragraphs.each do |par|
par.each_text_run do |trun|
puts trun.text
end
puts ‘-’ * 18
end
This produces:
ORNSTEIN, A, C., LEVINE, D. U.
Foundations of Education
. Boston : Houghton Mifflin 1997.
Pasch, M.,
a kol.
:
Od vzdělávacího programu k vyučovací hodině.
Praha, Portál 1998.
[rest omitted]
The first entry actually says “ORNSTEIN”, but the seconds entry says
“Pasch”.
###########################
Ruby has got a few useful functions to inspect elements.
par = Docx::Document.open(…).paragraphs.first
par.methods # array of methods you can call
par.instance_variables # array of all instance variables
par.instance_variable_get(:@node) #get variables w/o attr_reader
par.inspect # string listing all instance variables and their value
In particular, par.node gives you the Nokogiri node associated with the
element. Docx is a bit limited in what it can do, but it handles I/O for
you and you could always just access the xml directly. Capitals are
stored as a
<w:caps/>
tag. Something of a hack to get rid of the ‘Capitals’ formatting:
require ‘docx’
doc = Docx::Document.open(‘biblio_verzalky.docx’)
doc.paragraphs.each do |par|
par.node.search(“//*[local-name()=‘caps’]”).remove
par.each_text_run do |trun|
# more processing
end
end
doc = Docx::Document.open(‘biblio_verzalky.docx’)
doc.paragraphs.each do |par|
par.node.search("//*[local-name()=‘caps’]").remove
end
doc.paragraphs.each do |par|
par.each_text_run do |trun|
break if trun.formatting[:italic]
trun.text = trun.text.gsub(regex) do
Unicode::upcase($1) + Unicode::downcase($2)
end
end
end
It seems you will make my day! I had totally forgotten the one important
thing! Some records were badly formatted like capital letters, other
not. When I am trying to improve that in LibreOffice, than it works
without any error. Will test that on our production Word app and let you
know. In case it worked, I would like to use that in several similar
scenarios (many typical substitutions MS Word manages with
difficulties).
And more, your last notice about removing caps directly…mmm…awesome.
Thanks a lot, I will save that for later. Your basic attitude solves
what I need now (the problem \x03 dissapeared suddently). The second
solution, including removing capitals and with other regex, solves the
problem too much globally, however, it is for me a great information how
it is possible to work with that!
In a case I won’t be able to continue, I will let you know.
There are many ways to handle this issue. Let me suggest one possible
solution: merge all the adjacent text_runs with the same formatting
(bold, italic, underline) into one text_run, so that the xml looks like
this:
<w:r>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t xml:space=“preserve”>Adler, M. D. – Posner, E. A., eds.: </w:t>
</w:r>
<w:r>
<w:rPr>
<w:i/>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t>Cost-Benefit Analysis: Legal, Economic, and Philosophical
Perspectives. Chicago: University of Chicago Press, 2001.</w:t>
</w:r>
<w:r>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t>. Chicago:</w:t>
</w:r>
<w:r>
<w:rPr>
<w:lang w:val=“en-GB”/>
</w:rPr>
<w:t xml:space=“preserve”>Applebaum, E. – Katz, E.: Measures of Risk
Aversion and Comparative Statics of Industry Equilibrium. </w:t>
</w:r>
Add the following code to the BiblioReplace class:
def merge! @doc.paragraphs.each do |par|
truns = par.text_runs
buffer = ‘’
i = 0
n = truns.length - 1
truns.each do |trun|
buffer << trun.text
if i!=n && trun.formatting == truns[i+1].formatting
trun.text = ‘’
else
trun.text = buffer
buffer.clear
end
i += 1
end
end
end
And then call BiblioReplace#merge! before calling #dashes.
For reference I attached a working ruby script, the input file, and the
file created by the ruby script.
However, I have just met another one example where I really can’t figure
out what is going about. If you don’t mind, please, try these examples.
They include the same but in reversed order. One passes well, the other
has one record omitted.
The code here is alternative to the first one. I have turned the whole
thing into a class (don’t laugh here, I am trying to run that on a
server in Sinatra session). The first scenario (BIG LETTERS) function is
ommited here:
#!/usr/bin/ruby
encoding: utf-8
require “docx”
require “unicode”
class BiblioReplace < Docx::Document
def initialize(docx_file_name) @docx_file_name = docx_file_name @subst02 =
“([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ]).,\s([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ])([a-záéěíóúůýčďňřšťž])”
end
def open_file @doc = Docx::Document.open(@docx_file_name)
end
def dashes @doc.paragraphs.each do |p|
p.each_text_run do |trun|
trun.text = trun.text.gsub(/#{@subst02}/, “\1. – \2\3”)
end
end
end
Getting rid of all Capitals was Nokogiri magic, but this is neither
Nokogiri nor docx magic - it’s a docx hack because docx lacks several
features (see the author’s comments on his github page).
It doesn’t change anything, but style-wise, the code above should be
def merge!
@doc.paragraphs.each do |par|
truns = par.text_runs
buffer = ''
n = truns.length - 1
truns.each_with_index do |trun, i|
buffer << trun.text
if i!=n && trun.formatting == truns[i+1].formatting
trun.text = ''
else
trun.text = buffer
buffer.clear
end
end
end
end