I'm working on a script that examines a DITA XML file and tries to determine where we put conrefs (where content is being pulled from). I have most of the code working but I'm trying now to determine what type of element something comes from. All XML tags have ID numbers <p id="a124">this is a paragraph</p><ul id="b234567"><li id="a4563">list item</li></ul> If I need to reference the list item in a document for example, the id number is used to pull that data into the other document. What the script is trying to accomplish is to create a list of what conrefs are in each file and reporting on them. It's easy enough to determine if a con ref is in a file, then open that document to get the title of the document. But what is killing me is trying to determine what type of element is being referenced. For example, all I know is I'm looking for: 'a4563' easy enough to find via a .match, but what I really want to know is what element is that id number part of in the example of 'a4563' <li>, in the case of 'a124' a <p>. I suspect that I'll need to do some regex groupings, but my regex-fu in this area is very weak! Anybody have some suggestions? Thanks, Wayne
on 2013-03-13 11:49
on 2013-03-13 11:59
wow that would do it since I have the id I gotta work on my regex more! Thanks Joel! Wayne
on 2013-03-13 12:03
Well this works require 'rubygems' require 'nokogiri' xml = '<doc><p id="a124">this is a paragraph</p><ul id="b234567"><li id="a4563">list item</li></ul></doc>' doc = Nokogiri::XML(xml) node = doc.search("//*[@id='b234567']").first puts node puts node.name Once you have the node then the name method will tell you the element type. NEVER USE A REGEX!!!!!
on 2013-03-13 12:09
Agreed, Nokogiri is a much better solution.
on 2013-03-13 12:11
I agree too thanks for the head slap Peter!
on 2013-03-13 12:46
Just for the record the regex you gave will have difficulty with the following <fred ref="other" id="12"> It will give the node name as 'fred ref="other"' because you are assuming that the id attribute is the first attribute after the element name, which may not be the case. Of course you can make the regex handle that too. But then the regex becomes even less readable.
on 2013-03-13 12:58
I didn't go all exhaustive on it, just a general idea :)
on 2013-03-13 19:16
On Wed, Mar 13, 2013 at 12:58 PM, Joel Pearson <email@example.com> wrote: > I didn't go all exhaustive on it, just a general idea :) Yes, and you have been shown why regexp is the wrong tool for parsing SGML heritage - especially when there is something as awesome as Nokogiri around. :-) require 'nokogiri' dom = Nokogiri::XML <<XML <doc><p id="a124">this is a paragraph</p><ul id="b234567"><li id="a4563">list item</li></ul></doc> XML dom.xpath('//*[@id]').each do |node| printf "%-10s %s\n", node[:id], node.name end Kind regards robert