Ideas on how to determine tag

I’m working on a script that examines a DITA XML file and tries to
determine where we put conrefs (where content is being pulled from). I
have most of the code working but I’m trying now to determine what type
of element something comes from.

All XML tags have ID numbers

this is a paragraph

  • list item

If I need to reference the list item in a document for example, the id
number is used to pull that data into the other document. What the
script is trying to accomplish is to create a list of what conrefs are
in each file and reporting on them.

It’s easy enough to determine if a con ref is in a file, then open that
document to get the title of the document. But what is killing me is
trying to determine what type of element is being referenced. For
example, all I know is I’m looking for: ‘a4563’ easy enough to find
via a .match, but what I really want to know is what element is that id
number part of in the example of ‘a4563’

  • , in the case of ‘a124’ a

    .

    I suspect that I’ll need to do some regex groupings, but my regex-fu in
    this area is very weak!

    Anybody have some suggestions?

    Thanks,
    Wayne

  • I haven’t looked into this in detail, but I’ve cobbled together this
    example of how you could get started:

    wow that would do it since I have the id I gotta work on my regex more!

    Thanks Joel!

    Wayne

    Agreed, Nokogiri is a much better solution.

    I agree too thanks for the head slap Peter!

    Well this works

    require ‘rubygems’
    require ‘nokogiri’

    xml = ‘

    this is a paragraph

      list item

    doc = Nokogiri::XML(xml)

    node = doc.search("//*[@id=‘b234567’]").first
    puts node
    puts node.name

    Once you have the node then the name method will tell you the element
    type.

    NEVER USE A REGEX!!!

    I didn’t go all exhaustive on it, just a general idea :slight_smile:

    Just for the record the regex you gave will have difficulty with the
    following

    It will give the node name as ‘fred ref=“other”’ because you are
    assuming
    that the id attribute is the first attribute after the element name,
    which
    may not be the case. Of course you can make the regex handle that too.
    But
    then the regex becomes even less readable.

    On Wed, Mar 13, 2013 at 12:58 PM, Joel P. [email protected]
    wrote:

    I didn’t go all exhaustive on it, just a general idea :slight_smile:

    Yes, and you have been shown why regexp is the wrong tool for parsing
    SGML heritage - especially when there is something as awesome as
    Nokogiri around. :slight_smile:

    require ‘nokogiri’

    dom = Nokogiri::XML <<XML

    this is a paragraph

    • list item

    XML

    dom.xpath(‘//*[@id]’).each do |node|
    printf “%-10s %s\n”, node[:id], node.name
    end

    Kind regards

    robert