I'm working on a script that examines a DITA XML file and tries to determine where we put conrefs (where content is being pulled from). I have most of the code working but I'm trying now to determine what type of element something comes from. All XML tags have ID numbers <p id="a124">this is a paragraph</p><ul id="b234567"><li id="a4563">list item</li></ul> If I need to reference the list item in a document for example, the id number is used to pull that data into the other document. What the script is trying to accomplish is to create a list of what conrefs are in each file and reporting on them. It's easy enough to determine if a con ref is in a file, then open that document to get the title of the document. But what is killing me is trying to determine what type of element is being referenced. For example, all I know is I'm looking for: 'a4563' easy enough to find via a .match, but what I really want to know is what element is that id number part of in the example of 'a4563' <li>, in the case of 'a124' a <p>. I suspect that I'll need to do some regex groupings, but my regex-fu in this area is very weak! Anybody have some suggestions? Thanks, Wayne
on 2013-03-13 11:49
on 2013-03-13 11:55
I haven't looked into this in detail, but I've cobbled together this example of how you could get started: http://www.rubular.com/r/qO6KRhE91b
on 2013-03-13 11:59
wow that would do it since I have the id I gotta work on my regex more! Thanks Joel! Wayne
on 2013-03-13 12:03
Well this works
require 'rubygems'
require 'nokogiri'
xml = '<doc><p id="a124">this is a paragraph</p><ul id="b234567"><li
id="a4563">list item</li></ul></doc>'
doc = Nokogiri::XML(xml)
node = doc.search("//*[@id='b234567']").first
puts node
puts node.name
Once you have the node then the name method will tell you the element
type.
NEVER USE A REGEX!!!!!
on 2013-03-13 12:46
Just for the record the regex you gave will have difficulty with the following <fred ref="other" id="12"> It will give the node name as 'fred ref="other"' because you are assuming that the id attribute is the first attribute after the element name, which may not be the case. Of course you can make the regex handle that too. But then the regex becomes even less readable.
on 2013-03-13 19:16
On Wed, Mar 13, 2013 at 12:58 PM, Joel Pearson <lists@ruby-forum.com>
wrote:
> I didn't go all exhaustive on it, just a general idea :)
Yes, and you have been shown why regexp is the wrong tool for parsing
SGML heritage - especially when there is something as awesome as
Nokogiri around. :-)
require 'nokogiri'
dom = Nokogiri::XML <<XML
<doc><p id="a124">this is a paragraph</p><ul id="b234567"><li
id="a4563">list item</li></ul></doc>
XML
dom.xpath('//*[@id]').each do |node|
printf "%-10s %s\n", node[:id], node.name
end
Kind regards
robert
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.