Forum: Ruby ideas on how to determine tag

Posted by Wayne Brissette (Guest)
on 2013-03-13 11:49
(Received via mailing list)
I'm working on a script that examines a DITA XML file and tries to 
determine where we put conrefs (where content is being pulled from). I 
have most of the code working but I'm trying now to determine what type 
of element something comes from.

All XML tags have ID numbers <p id="a124">this is a paragraph</p><ul 
id="b234567"><li id="a4563">list item</li></ul>

If I need to reference the list item in a document for example, the id 
number is used to pull that data into the other document. What the 
script is trying to accomplish is to create a list of what conrefs are 
in each file and reporting on them.

It's easy enough to determine if a con ref is in a file, then open that 
document to get the title of the document. But what is killing me is 
trying to determine what type of element is being referenced. For 
example, all I know is I'm looking for: 'a4563'    easy enough to find 
via a .match, but what I really want to know is what element is that id 
number part of in the example of 'a4563' <li>, in the case of 'a124' a 
<p>.

I suspect that I'll need to do some regex groupings, but my regex-fu in 
this area is very weak!

Anybody have some suggestions?

Thanks,
Wayne
Posted by Joel Pearson (virtuoso)
on 2013-03-13 11:55
I haven't looked into this in detail, but I've cobbled together this 
example of how you could get started:

http://www.rubular.com/r/qO6KRhE91b
Posted by Wayne Brissette (Guest)
on 2013-03-13 11:59
(Received via mailing list)
wow that would do it since I have the id I gotta work on my regex more!

Thanks Joel!

Wayne
Posted by Peter Hickman (Guest)
on 2013-03-13 12:03
(Received via mailing list)
Well this works

require 'rubygems'
require 'nokogiri'

xml = '<doc><p id="a124">this is a paragraph</p><ul id="b234567"><li
id="a4563">list item</li></ul></doc>'
doc = Nokogiri::XML(xml)

node = doc.search("//*[@id='b234567']").first
puts node
puts node.name

Once you have the node then the name method will tell you the element 
type.

NEVER USE A REGEX!!!!!
Posted by Joel Pearson (virtuoso)
on 2013-03-13 12:09
Agreed, Nokogiri is a much better solution.
Posted by Wayne Brissette (Guest)
on 2013-03-13 12:11
(Received via mailing list)
I agree too thanks for the head slap Peter!
Posted by Peter Hickman (Guest)
on 2013-03-13 12:46
(Received via mailing list)
Just for the record the regex you gave will have difficulty with the
following

<fred ref="other" id="12">

It will give the node name as 'fred ref="other"' because you are 
assuming
that the id attribute is the first attribute after the element name, 
which
may not be the case. Of course you can make the regex handle that too. 
But
then the regex becomes even less readable.
Posted by Joel Pearson (virtuoso)
on 2013-03-13 12:58
I didn't go all exhaustive on it, just a general idea :)
Posted by Robert Klemme (robert_k78)
on 2013-03-13 19:16
(Received via mailing list)
On Wed, Mar 13, 2013 at 12:58 PM, Joel Pearson <lists@ruby-forum.com> 
wrote:
> I didn't go all exhaustive on it, just a general idea :)

Yes, and you have been shown why regexp is the wrong tool for parsing
SGML heritage - especially when there is something as awesome as
Nokogiri around. :-)

require 'nokogiri'

dom = Nokogiri::XML <<XML
<doc><p id="a124">this is a paragraph</p><ul id="b234567"><li
id="a4563">list item</li></ul></doc>
XML

dom.xpath('//*[@id]').each do |node|
  printf "%-10s %s\n", node[:id], node.name
end

Kind regards

robert
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.