[libxml]: Can't find nodes using XPath, namespaces mess

swozniak · July 31, 2009, 10:33pm

Hi,

I am having problems accessing elements in the XML documents using
XPath. My xml document looks like that:

<?xml version="1.0" encoding="UTF-8"?>

<configuration-data
xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”;
xsi:schemaLocation=“urn:company:platform:foundation:configuration:defn:v1”
xmlns=“urn:company:platform:foundation:configuration:defn:v1”>

My XPath only works when I remove all the namespaces from the root node
but I do need to access it without modifying the xml.

I am using:
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-mswin32]
libxml-ruby (1.1.3)

swozniak · August 1, 2009, 5:01am

Stanislaw W. [email protected] wrote:

My XPath only works when I remove all the namespaces from the root node
but I do need to access it without modifying the xml.

Have your run your XML thru a validator? That semicolon looks invalid to
me. m.

swozniak · August 1, 2009, 5:25am

As Matt said, the document is not well-formed XML. Try adding the
RECOVER option to the parser, which tells libxml to ignore syntax
errors like that.

swozniak · August 1, 2009, 2:01pm

Hi, this was a typo, no semicolon in there:

<?xml version="1.0" encoding="UTF-8"?>

swozniak · August 1, 2009, 5:50pm

Stanislaw W. [email protected] wrote:

Then what’s the problem? XPath works:

s = <<END

<?xml version="1.0" encoding="UTF-8"?>

END
require ‘rexml/document’
include REXML
doc = Document.new(s)
p XPath.match(doc, “//treenode[‘Root’]/treenode”)
#=> []

Oh, wait, you said you were using libxml:

s = <<END

<?xml version="1.0" encoding="UTF-8"?> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:company:platform:foundation:configuration:defn:v 1" xmlns="urn:company:platform:foundation:configuration:defn:v1"> END require 'rubygems' require 'xml' doc = XML::Document.string(s) doc.find("//treenode['Root']/treenode").each do |el| p el #=> end

Sorry, I’m failing to guess what problem you’re having. Perhaps if you
showed your actual code? m.

swozniak · August 3, 2009, 12:26am

On Sat, Aug 01, 2009 at 05:33:31AM +0900, Stanislaw W. wrote:

Hi,

I am having problems accessing elements in the XML documents using
XPath. My xml document looks like that:
<?xml version="1.0" encoding="UTF-8"?>
<configuration-data
xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”;
xsi:schemaLocation=“urn:company:platform:foundation:configuration:defn:v1”
xmlns=“urn:company:platform:foundation:configuration:defn:v1”>

^^^^^ That says that all nodes inside this document (if not explicitly
namespaced) belong to an implicit namespace

My XPath only works when I remove all the namespaces from the root node
but I do need to access it without modifying the xml.

You need to register that namespace with the libxml xpath engine. I’m
not sure how you register namespaces with libxml-ruby, but with
nokogiri, I would do this:

doc = Nokogiri::XML(xml)
doc.xpath(‘//ns:attribute’, ‘ns’ =>
‘urn:company:platform:foundation:configuration:defn:v1’)

Nokogiri will automatically register root level namespaces, so you could
also do this:

doc = Nokogiri::XML(xml)
doc.xpath(‘//xmlns:attribute’)

I know there is a way to do this with libxml-ruby, I just don’t know the
syntax off the top of my head. Look through the libxml-ruby
documentation for “find”, and I’m sure you’ll find how to register
namespaces.

swozniak · August 3, 2009, 12:40am

On Sun, Aug 02, 2009 at 12:50:05AM +0900, Matt N. wrote:

include REXML
doc = Document.new(s)
p XPath.match(doc, “//treenode[‘Root’]/treenode”)
#=> []

Wow. These results are just wrong. This is a bug in REXML. In XPath,
when you do not specify a namespace for your node, that means that you
want a node with no namespace.

For example:

require ‘rexml/document’

include REXML

s = <<END

<?xml version="1.0" encoding="UTF-8"?>

<!-- bike inventory -->
<inventory xmlns="http://schwinn.com/">
  <tire name="street" />
</inventory>

<!-- no namespace inventory -->
<inventory>
  <tire name="wtf" />
</inventory>

END

doc = Document.new(s)

p XPath.match(doc, “//tire”)

REXML matches all three tires. Surely a car tire is not the same as a
bike
tire? Using XPath, how would I query for a tire that has no namespace
(the third one) without matching the two that do belong in a
namespace (it’s possible to do this with REXML, just strange)? The
XPath used
above should only match the third entry.

This is a broken implementation of XPath.

Oh, wait, you said you were using libxml:

You have an error in your XML below

s = <<END
<?xml version="1.0" encoding="UTF-8"?>

                 ^ That ">" should not be there.

libxml-ruby has corrections turned on by default, so you’ve effectively
removed all namespaces from this document.

end
Sorry, I’m failing to guess what problem you’re having. Perhaps if you
showed your actual code? m.

Since the namespaces were removed, this example succeeds.

swozniak · August 3, 2009, 6:25am

Aaron P. [email protected] wrote:

You have an error in your XML below

Thanks for spotting that. I must have removed the namespace and then put
it back, to see if I could duplicate the OP’s problems, and I must have
put it back wrong. I wish libxml had just complained that my XML was
bad…

You’re right; fixing the error, I can now duplicate the OP’s problem in
libxml (but not in REXML, as you also observed). And then I can solve
it:

s = <<END

<?xml version="1.0" encoding="UTF-8"?>

END
require ‘rubygems’
require ‘xml’
doc = XML::Document.string(s)
ns = {“xsi” => “urn:company:platform:foundation:configuration:defn:v1”}
doc.find(“//xsi:treenode[‘Root’]/xsi:treenode”, ns).each do |el|
p el #=>
end

That is the desired sort of result, I take it. Notice that we register
the namespace with the XPath engine and that we actually use the
namescape in our XPath expression. m.