Docx (or htm, html) editing / substitutions

welblaud · September 1, 2014, 8:05am

Oh, Sensei, I am bowing… It seems it really works (except only one
thing /potentially dangerous/ I have figured right now).

In case of capitalized names of organizations (which should remain so),
it behaves strangely (Rubular does not help, there it works).

It apparently takes the whole name as a first founded char and according
to the arranging of the substitution it places that like a single char!
Damn, can’t understand why!

It makes this:
A school-based intervention for diabetes risk reduction.HEALTHY STUDY
GROUP et al. (2010). N. Engl. J. Med., 29, 363 (5), 443–53.

from this:
HEALTHY STUDY GROUP et al. (2010). A school-based intervention for
diabetes risk reduction. N. Engl. J. Med., 29, 363 (5), 443–53.

I wanted to add the last one function, converting initials like J.S. to
J. S., it was the same problem, it took the whole name and returned
nonsenses. Rubular does not see the name, the app itself sees It
seems it somewhat ignores . in the regex and maybe sees that like
simple [A-Z.]…

@subst01 =
“(?<=([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ]{1}))([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ][^.a-z\s>]{0,}),([\s\r\n])([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ.])”

@subst02 =
“([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ.])([;,])\s([A-ZÁÉĚÍÓÚŮÝČĎŇŘŠŤŽ])([a-záéěíóúůýčďňřšťž])”

Everything other works perfectly.

However, great work, thank you again!

welblaud · September 2, 2014, 8:40am

Well, I have tested several different substitutions… Nothing works. As
soon as the substitution is done, it switches the line - no matters if I
have in the substitution \1 or \4 or \1\3 and so on.

In the xml guts of the file is:
<w:t xml:space=“preserve”>HEALTHY STUDY GROUP et al. (2010). </w:t>

When I try to add another one record with the same formatting, it passes
smoothly. I guess it must be connected with the xml line above.
Unfortunately, have no idea how to cope with that. Damn Word! Like a box
of kittens > try to arrange them for taking a pic…

welblaud · September 2, 2014, 12:09pm

The nextone important info could be, the “bug” is in the merge! function
(so much needed!). When I don’t use that, it lets the record
untouched…

welblaud · September 4, 2014, 12:23pm

Ok, thank you very much.

Here I am attaching a real file I have announced in the PM.

The merge! function probably still has a bit unexpected results.

welblaud · September 2, 2014, 4:49pm

I am not Japanese by the way, but どうも.

##################

It seems you’ve running an old version of docx. The newest version
supports
hyperlink for #text_runs:

# Array of text runs contained within paragraph

def text_runs
@node.xpath(‘w:r|w:hyperlink/w:r’).map {|r_node|
Containers::TextRun.new(r_node)
}
end

Latest version

GitHub - ruby-docx/docx: a ruby library/gem for interacting with .docx files

Relevant commit:

support for paragraph's textruns in hyperlinks · ruby-docx/docx@d2b3fbb · GitHub

You could update your gem, but #merge! would get rid of hyperlinks, as
text_runs do not know whether they are hyperlink or not.

##################

If you don’t want to change the source, monkey patch.

Add this directly after the <require ‘docx’>. It adds support for
hyperlinks, and adds the entry :link to the @formatting hash.

Docx::Document.allocate
module Docx::Elements::Containers
class Paragraph
def text_runs
@node.xpath(‘w:r|w:hyperlink/w:r’).map do |r_node|
trun = TextRun.new(r_node)
trun.formatting[:link] = trun.parent.node_name == ‘hyperlink’
trun
end
end
end
end

The merge function can remain unchanged. Inside the merge method, you
can use

trun.formatting[:link]

to check whether or not the text_run is a link.

##################

And please, try your best to understand the code and the gem yourself -
I won’t be around forever to fix the code when it breaks.

welblaud · September 4, 2014, 1:54pm

Well, I have done next tests and it seems it depends on if the dot and
space are or are not italicized.

Often the dot ends up on the separate line when I use puts trun.text and
this is probably the problem (or at the very beginning of the next w:t
element; then the element starts with the dot and a space). I can’t
imagine to have this treated always properly in manuscripts.

Again (damn Word).

welblaud · September 4, 2014, 4:24pm

Edit: I am currently taking a look at the other file you attached,
Hlavacek_Maj.docx

Windows and Microsoft are the source of all evil.

With smart tags enabled, Microsoft Word attempts to recognize certain types
of data in a document (for example, dates or names) and automatically makes
such text a smart tag, visually indicated as a purple dotted underline.
Clicking on a smart tag is the selection-based search command to bring up a
list of possible actions for that data type.

The xml of the file Hlavacek_bibliografie.docx looks like this.

 <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags"
 w:element="place">
   <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags"
   w:element="PlaceType">
     <w:r w:rsidRPr="00D92EF0">
       <w:rPr>
         <w:lang w:val="en-GB"/>
       </w:rPr>
       <w:t>University</w:t>
     </w:r>
   </w:smartTag>

# ####

All you need to do is tweak the xpath expression in #text_runs
slightly so that it can handle these nodes.

Change the xpath from

‘w:r|w:hyperlink/w:r’

to

‘.//w:hyperlink/w:r|.//w:r’

Thus, the entire monkey patch now looks like this:

Docx::Document.allocate
module Docx::Elements::Containers
class Paragraph
def text_runs
@node.xpath(‘.//w:hyperlink/w:r|.//w:r’).map do |r_node|
trun = TextRun.new(r_node)
trun.formatting[:link] = trun.parent.node_name == ‘hyperlink’
trun
end
end
end
end

#######

Often the dot ends up on the separate line when I use puts trun.text and
this is probably the problem

Not really. It just means that the dot . is in a separate <w:t> tag -
and that’s exactly what #merge! should fix.

welblaud · September 4, 2014, 5:38pm

Add another line of code to support merging different font colors:

trun.formatting[:text_color] = (x = trun.node.xpath(’.//w:color[@w:val]’).first)
? x.attr(‘w:val’) : ‘’

(That’s one line.) This add the font color the the attributes hash.

The monkey patch now becomes:

Docx::Document.allocate
module Docx::Elements::Containers
class Paragraph
def text_runs
@node.xpath(’.//w:hyperlink/w:r|.//w:r’).map do |r_node|
trun = TextRun.new(r_node)
trun.formatting[:link] = trun.parent.node_name == ‘hyperlink’
trun.formatting[:text_color] = (x =
trun.node.xpath(’.//w:color[@w:val]’).first) ? x.attr(‘w:val’) : ‘’
trun
end
end
end
end

Support for different fonts can be added similarly, the node is
w:rFonts, and the attributes are w:asciiTheme and w:hAnsiTheme
(font for Asian languages, especially Han(zi) Glyphs).

Joins all text runs with the same formatting. They is considered the

same if they agree in each of the following properties.

- italic, bold, underlined

- font color

- being a hyperlink

def merge!
…
end