Ideas on how to determine tag

Wayne_B · March 13, 2013, 11:49am

I’m working on a script that examines a DITA XML file and tries to
determine where we put conrefs (where content is being pulled from). I
have most of the code working but I’m trying now to determine what type
of element something comes from.

All XML tags have ID numbers

this is a paragraph

list item

If I need to reference the list item in a document for example, the id
number is used to pull that data into the other document. What the
script is trying to accomplish is to create a list of what conrefs are
in each file and reporting on them.

It’s easy enough to determine if a con ref is in a file, then open that
document to get the title of the document. But what is killing me is
trying to determine what type of element is being referenced. For
example, all I know is I’m looking for: ‘a4563’ easy enough to find
via a .match, but what I really want to know is what element is that id
number part of in the example of ‘a4563’

, in the case of ‘a124’ a

.

I suspect that I’ll need to do some regex groupings, but my regex-fu in
this area is very weak!

Anybody have some suggestions?

Thanks,
Wayne

Wayne_B · March 13, 2013, 11:55am

I haven’t looked into this in detail, but I’ve cobbled together this
example of how you could get started:

Wayne_B · March 13, 2013, 11:59am

wow that would do it since I have the id I gotta work on my regex more!

Thanks Joel!

Wayne

Wayne_B · March 13, 2013, 12:09pm

Agreed, Nokogiri is a much better solution.

Wayne_B · March 13, 2013, 12:11pm

I agree too thanks for the head slap Peter!

Wayne_B · March 13, 2013, 12:03pm

Well this works

require ‘rubygems’
require ‘nokogiri’

xml = ‘

this is a paragraph

list item’
doc = Nokogiri::XML(xml)

node = doc.search("//*[@id=‘b234567’]").first
puts node
puts node.name

Once you have the node then the name method will tell you the element
type.

NEVER USE A REGEX!!!

Wayne_B · March 13, 2013, 12:58pm

I didn’t go all exhaustive on it, just a general idea

Wayne_B · March 13, 2013, 12:46pm

Just for the record the regex you gave will have difficulty with the
following

It will give the node name as ‘fred ref=“other”’ because you are
assuming
that the id attribute is the first attribute after the element name,
which
may not be the case. Of course you can make the regex handle that too.
But
then the regex becomes even less readable.

Wayne_B · March 13, 2013, 7:16pm

On Wed, Mar 13, 2013 at 12:58 PM, Joel P. [email protected]
wrote:

I didn’t go all exhaustive on it, just a general idea

Yes, and you have been shown why regexp is the wrong tool for parsing
SGML heritage - especially when there is something as awesome as
Nokogiri around.

require ‘nokogiri’

dom = Nokogiri::XML <<XML

this is a paragraph

list item

XML

dom.xpath(‘//*[@id]’).each do |node|
printf “%-10s %s\n”, node[:id], node.name
end

Kind regards

robert