Can you search in REXML by attributes?

dubstep · April 1, 2011, 2:53am

Hello and thank you to all the wonderful and helpful people at this
forum. I am trying to figure out how to search through an XML file and
grab information. I have been reading the REXML tutorials but could not
see an answer to my problem in them
(http://www.germane-software.com/software/rexml/docs/tutorial.html).
The problem is I need to search by an attribute (in this case the ref)
and cannot figure out how. Here is a snippet of the XML I am trying to
extract information from:

117.4 119.7 0.

So basically I have to start with IfcWallStandardCase and from there
work my way through the “ref”'s until I get to the 3 IfcLengthMeasures.
I know how to grab the first ref “i1671” using:
XPath.match(doc,“/IfcWallStandardCase/ObjectPlacement/IfcLocalPlacement”)
and some additional code.

My problem is I cannot figure out how to use this “i1671” to search the
xml and grab the next ref. This ref is the only thing linking the items
together, so it is the only thing that I can use.

Is it possible to search a document by using an attribute, and if so
how? In this case to use the ref, “i1671” to search the document for
where it is used as id=“i1671” so I can grab the next ref from there and
so on. Any help would be greatly appreciated.

Thank you all.

Sincerely,
Kyle

kyle_x · April 1, 2011, 3:27am

I’m not sure what you are after. Typically, its much easier to say,
“This is my xml, this is the output I want.”

require ‘rexml/document’
include REXML

xml =<<XML

117.4 119.7 0. XML

doc = Document.new xml
target = XPath.match(doc, “//*[@id = ‘i1671’]”)
p target

–output:–
[ … </>]

kyle_x · April 1, 2011, 3:45am

…the xpath:

 //*[@id = 'i1671']

finds all tags with an id attribute whose value is ‘i1671’. You might
want to check out an XPath tutorial to get specifics on XPath–rather
than the REXML docs–e.g.:

http://www.w3schools.com/xpath/xpath_syntax.asp

However, note that just because it’s possible in xpath doesn’t
necessarily mean that REXML supports it. I’m not sure how fully REXML
supports XPath.

kyle_x · April 1, 2011, 10:13am

On Fri, Apr 1, 2011 at 2:53 AM, Kyle X. [email protected] wrote:

XPath.match(doc,"/IfcWallStandardCase/ObjectPlacement/IfcLocalPlacement") and some additional code.
My problem is I cannot figure out how to use this “i1671” to search the
xml and grab the next ref. This ref is the only thing linking the items
together, so it is the only thing that I can use.

Is it possible to search a document by using an attribute, and if so
how? In this case to use the ref, “i1671” to search the document for
where it is used as id=“i1671” so I can grab the next ref from there and
so on. Any help would be greatly appreciated.

It is not entirely clear what you want. Do you want to look for all
“ref” instances and find elements they are referring to? Or do you
want to do some kind of graph traversal where you start with a
particular element and follow every ref attribute?

If the latter you can for example do a BFS.

10:11:30 Temp$ ./rx.rb
— VISIT:

— VISIT:

117.4
119.7
0.

10:11:43 Temp$ cat -n rx.rb
1 #!/bin/env ruby19
2
3 require ‘rexml/document’
4
5 doc = REXML::Document.new(DATA.read)
6
7 # BFS
8 queue = %w{i1671}
9
10 until queue.empty?
11 id = queue.shift
12
13 REXML::XPath.each(doc, "//[@id=‘#{id}’]") do |e|
14 puts “— VISIT:”, e
15
16 REXML::XPath.each(e, './/[@ref]') do |child|
17 next_id = child.attribute(‘ref’) and queue.push(next_id)
18 end
19 end
20 end
21
22 END
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 117.4
39 119.7
40 0.
41
42
43
10:11:47 Temp$

Kind regards

robert

kyle_x · April 1, 2011, 7:20pm

On Fri, Apr 1, 2011 at 6:27 PM, Kyle X. [email protected] wrote:

) and eventually grab the three

1 #!/bin/env ruby19 12 23 34

target = XPath.match(doc, “//*[@id = ‘i1671’]”)
p target

It produces the following output as you said it would.
–output:–
[ … </>]

But I cannot figure out how to do anything with this from here to get to
the next point, and eventually be able to grab the three values.

I usually use Nokogiri to handle XML documents, and find css selectors
easier than XPath. I’d do it like this:

doc = Nokogiri::XML(<<END

117.4 119.7 0. END )

reference = doc.css(“#i1671 Location
IfcCartesianPoint”).attribute(“ref”).value
doc.css(“##{reference} Coordinates IfcLengthMeasure”).map {|element|
element.text}

This returns: => [“117.4”, “119.7”, “0.”]

I’m pretty sure it’s easy to translate these two expressions to XPath,
something like:

reference = REXML::XPath.first(doc,
“//[@id=‘i1671’]/Location/IfcCartesianPoint").attribute(“ref”).value
elements = REXML::XPath.match(doc,
"//[@id=‘#{reference}’]/Coordinates/IfcLengthMeasure”).map {|element|
element.text}

don’t know if there’s a better way, but the above works for me.

Jesus.

kyle_x · April 1, 2011, 9:22pm

reference = REXML::XPath.first(doc,
“//*[@id=‘i1671’]/Location/IfcCartesianPoint”).attribute(“ref”).value

elements = REXML::XPath.match(doc,
“//*[@id=’#{reference}’]/Coordinates/IfcLengthMeasure”).map
{|element|element.text}

Thank you Jesus, that is exactly what I was looking for and it works
great!

kyle_x · April 1, 2011, 6:27pm

Robert K. wrote in post #990336:

On Fri, Apr 1, 2011 at 2:53 AM, Kyle X. [email protected] wrote:

It is not entirely clear what you want. Do you want to look for all
“ref” instances and find elements they are referring to? Or do you
want to do some kind of graph traversal where you start with a
particular element and follow every ref attribute?

Hi and thank you for your help. I am sorry if what I wrote was unclear.
What my goal is is to start at a given location (in this case-
) and eventually grab the three
IfcLengthMeasure text values, that are associated with this
, and put them into an array.

If the latter you can for example do a BFS.

10:11:30 Temp$ ./rx.rb
— VISIT:

— VISIT:

117.4
119.7
0.

10:11:43 Temp$ cat -n rx.rb
1 #!/bin/env ruby19
2
3 require ‘rexml/document’
4
5 doc = REXML::Document.new(DATA.read)
6
7 # BFS
8 queue = %w{i1671}
9
10 until queue.empty?
11 id = queue.shift
12
13 REXML::XPath.each(doc, "//[@id=‘#{id}’]") do |e|
14 puts “— VISIT:”, e
15
16 REXML::XPath.each(e, './/[@ref]') do |child|
17 next_id = child.attribute(‘ref’) and queue.push(next_id)
18 end
19 end
20 end
21
22 END
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 117.4
39 119.7
40 0.
41
42
43
10:11:47 Temp$

I will give this a try.

Kind regards

robert

Dear 7stud. Using the XPath command:

doc = Document.new xml
target = XPath.match(doc, “//*[@id = ‘i1671’]”)
p target

It produces the following output as you said it would.
–output:–
[ … </>]

But I cannot figure out how to do anything with this from here to get to
the next point, and eventually be able to grab the three values.

kyle_x · April 4, 2011, 8:44pm

Thank you too Robert. This is a great explanation.

kyle_x · April 4, 2011, 9:13am

On Fri, Apr 1, 2011 at 9:22 PM, Kyle X. [email protected] wrote:

Thank you Jesus, that is exactly what I was looking for and it works
great!

Here’s the same with REXML.

09:12:25 Temp$ ./rx2.rb
Approach 1
117.4
119.7
0.
Approach 2
117.4
119.7
0.
Approach 3
117.4
119.7
0.0
09:12:45 Temp$ cat -n rx2.rb
1 #!/bin/env ruby19
2
3 require ‘rexml/document’
4
5 doc = REXML::Document.new(DATA.read)
6
7 puts ‘Approach 1’
8
9 REXML::XPath.each(doc, “//[@id=‘i1671’]//@ref") do |e|
10 REXML::XPath.each(doc,
"//[@id=‘#{e.value}’]//IfcLengthMeasure/text()”) do |lm|
11 puts lm
12 end
13 end
14
15 puts ‘Approach 2’
16
17 refs = REXML::XPath.each(doc, “//[@id=‘i1671’]//@ref").map
{|e| e.value}
18 values = refs.map {|r| REXML::XPath.each(doc,
"//[@id=‘#{r}’]//IfcLengthMeasure/text()”).to_a}.flatten
19 puts values
20
21 puts ‘Approach 3’
22
23 refs = REXML::XPath.each(doc, “//[@id=‘i1671’]//@ref").map
{|e| e.value}
24 values = refs.map {|r| REXML::XPath.each(doc,
"//[@id=‘#{r}’]//IfcLengthMeasure/text()”).map {|x|
x.value.to_f}}.flatten
25 puts values
26
27 END
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43 117.4
44 119.7
45 0.
46
47
48
09:12:46 Temp$

Have fun!

Cheers

robert

kyle_x · April 11, 2011, 1:27am

Kyle X. wrote in post #992033:

“Jesús Gabriel y Galán” [email protected] wrote in post
#990433:

I usually use Nokogiri to handle XML documents, and find css selectors
easier than XPath. I’d do it like this:

doc = Nokogiri::XML(<<END

117.4 119.7 0. END )
reference = doc.css(“#i1671 Location
IfcCartesianPoint”).attribute(“ref”).value
doc.css(“##{reference} Coordinates IfcLengthMeasure”).map {|element|
element.text}

This returns: => [“117.4”, “119.7”, “0.”]

Please disregard the previous post. I figured out how to load the
files, but it does not appear to be reading correctly.

Using-
fname = File.open(“C:/Users/Kyle/Desktop/CSUF/Research Winter
11/IFXCML/automation trials/one.xml”)
$doc = Nokogiri::XML(fname)
reference = $doc.css(" IfcCartesianPoint Coordinates
IfcLengthMeasure").first

This produce and output of nil when it should be “117.4” correct? What
is going wrong here?

kyle_x · April 11, 2011, 1:01am

“Jesús Gabriel y Galán” [email protected] wrote in post
#990433:

I usually use Nokogiri to handle XML documents, and find css selectors
easier than XPath. I’d do it like this:

doc = Nokogiri::XML(<<END

117.4 119.7 0. END )
reference = doc.css(“#i1671 Location
IfcCartesianPoint”).attribute(“ref”).value
doc.css(“##{reference} Coordinates IfcLengthMeasure”).map {|element|
element.text}

This returns: => [“117.4”, “119.7”, “0.”]

Thank you for the help. I had been using REXML and it was working find
with one exception, it can be very slow. So now I am trying to use
Nokogiri, and am running into a very simple error, I cannot load xml
files. From the Nokogiri website I have been trying what is in their
tutorial:
f = File.open(“blossom.xml”)
doc = Nokogiri::XML(f)
f.close

But regardles of what I put in the (“…”) it returns -
Error: #<Errno::EINVAL: C:/Program Files (x86)/Google/Google SketchUp
8/Plugins/examples/auto.rb:11:in read': Invalid argument - c:ourwalls.xml> or Error: #<Errno::ENOENT: C:/Program Files (x86)/Google/Google SketchUp 8/Plugins/examples/auto.rb:11:in read’: No such file or directory -
fourwalls.xml>

If I am simply trying to open an XML file named file.xml located at C:,
what would I put to open it? I have tried many things such as f
=File.open(“C:\file.xml”) and what not with no luck. What do I need to
do to open this? Do you simply need to change the \ to /?

kyle_x · April 11, 2011, 8:57pm

This works for me:

$ cat noko.rb && ruby -rubygems noko.rb
require ‘nokogiri’

doc = Nokogiri::XML(File.read(“one.xml”))
reference = doc.css(“IfcCartesianPoint Coordinates
IfcLengthMeasure”).first
puts reference.text

117.4

But if I leave a space before IfcCartesianPoint in the call to the css
method I get a parser error (`on_error’: unexpected ’ ’ after ‘’
(Nokogiri::CSS::SyntaxError)). This is my file one.xml:
117.4 119.7 0.
This works too:

doc.css(“IfcCartesianPoint Coordinates IfcLengthMeasure”).each {|el|
puts el.text}

Produces:

117.4
119.7
0.

Thank you for your reply. When I continue to try and read the file I
have it keeps returning nil values and thus doesn’t work. But when I
copy and paste the xml you have written over the file I am trying to
read then it does work. I understand that the path is slightly
different but using the xpath command-

doc.xpath(“//IfcCartesianPoint/Coordinates/IfcLengthMeasure”).each {|el|
puts el.text}

It should skip ahead to the first appearance of IfcCartesianPoint, much
the same as it works for using REXML xpath, no? As this same sting of
IfcCartesianPoint/Coordinates/IfcLengthMeasure appears in this file.
Based on the documentation here -

(Searching a XML/HTML document - Nokogiri)

I think it should be working but it always returns nil.

I have attached the xml file I am trying to read and was wondering if
you could see where my error is occurring. The first instance of
IfcCartesianPoint/Coordinates/IfcLengthMeasure appears on line 225.

Maybe it’s the way you are passing the file to Nokogiri::XML? By the
way, in your way you are not closing the file handler. If you want to
pass Nokogiri the file instead of reading it yourself you can do:

doc = nil
File.open(“one.xml”) {|f| doc = Nokogiri::XML(f)}

or

doc = File.open(“one.xml”) {|f| Nokogiri::XML(f)}

This way, the file is properly closed.

Jesus.

Is there an advantage to using .open vs .read? The program I am writing
has to grab lots of information from the xml, maybe 300 items, would it
make a difference in speed to use one vs the other? Also for me to read
a file at say c:\one.xml for it to read I have to write -
doc = File.open(“/one.xml”) {|f| Nokogiri::XML(f)}
Another form will not read including “\one.xml”

Thank you again for your time, you have been most helpful and it is
greatly appreciated.

kyle_x · April 12, 2011, 10:17am

On Mon, Apr 11, 2011 at 8:57 PM, Kyle X. [email protected] wrote:

Thank you for your reply. When I continue to try and read the file I
have it keeps returning nil values and thus doesn’t work. But when I
copy and paste the xml you have written over the file I am trying to
read then it does work.

The difference is that you have namespaces in your file. Check this URL:

http://tenderlovemaking.com/2009/04/23/namespaces-in-xml/

In order to make this work, you can do something like this:

require ‘nokogiri’

doc = Nokogiri::XML(File.read(“one.xml”))
doc.collect_namespaces.each {|key,value| puts “#{key} => #{value}”}
doc.css(“uosNS|IfcCartesianPoint uosNS|Coordinates
uosNS|IfcLengthMeasure”, {“uosNS” =>
“http://www.iai-tech.org/ifcXML/IFC2x3/FINAL”}).each {|el| puts
el.text}

(I added a line that shows all namespaces in the document). All nodes
under the uos node inherit the namespace referenced by the url you see
in the code, so in order to search for nodes within the uos node, you
need to specify the namespace.

Is there an advantage to using .open vs .read?

read reads the whole file in memory. Passing a file handler to
nokogiri will probably make no difference, because most likely it’s
reading the full file to memory too.

The program I am writing
has to grab lots of information from the xml, maybe 300 items, would it
make a difference in speed to use one vs the other?

The only answer to this question is to benchmark.

Also for me to read
a file at say c:\one.xml for it to read I have to write -
doc = File.open(“/one.xml”) {|f| Nokogiri::XML(f)}
Another form will not read including “\one.xml”

I have no experience in Windows, but I think forward slashes should
always work (no idea about the drive letter, though).

Jesus.

kyle_x · April 12, 2011, 11:46am

2011/4/12 Jess Gabriel y Galn [email protected]:

In order to make this work, you can do something like this:

require ‘nokogiri’

doc = Nokogiri::XML(File.read(“one.xml”))

The alternative is

doc = File.open(“one.xml”) {|io| Nokogiri::XML(io)}

Is there an advantage to using .open vs .read?

read reads the whole file in memory. Passing a file handler to
nokogiri will probably make no difference, because most likely it’s
reading the full file to memory too.

I will have to read the whole file but it may make a crucial
difference whether it does so in one go or in chunks. Large files
might not even be readable with the File.read approach. If you pass
the file as a single string there is no choice but if you pass the
File instance nokogiri can decide what to do. This is more efficient.
Note also that because of buffering small files will have just one
(or a few) IO operations anyway.

The program I am writing
has to grab lots of information from the xml, maybe 300 items, would it
make a difference in speed to use one vs the other?

The only answer to this question is to benchmark.

I don’t think the file loading influences access speed. Once the file
is loaded into a object structure IO is over and all operations are in
memory plus the model of the file will be the same regardless whether
you read in one big chunk or in smaller ones.

The two approaches to loading the file do most likely have different
performance characteristics though.

Kind regards

robert

kyle_x · April 11, 2011, 9:38am

On Mon, Apr 11, 2011 at 1:27 AM, Kyle X. [email protected] wrote:

reference = doc.css("#i1671 Location

fname = File.open(“C:/Users/Kyle/Desktop/CSUF/Research Winter
11/IFXCML/automation trials/one.xml”)
$doc = Nokogiri::XML(fname)
reference = $doc.css(" IfcCartesianPoint Coordinates
IfcLengthMeasure").first

This produce and output of nil when it should be “117.4” correct? What
is going wrong here?

This works for me:

$ cat noko.rb && ruby -rubygems noko.rb
require ‘nokogiri’

doc = Nokogiri::XML(File.read(“one.xml”))
reference = doc.css(“IfcCartesianPoint Coordinates
IfcLengthMeasure”).first
puts reference.text

117.4

But if I leave a space before IfcCartesianPoint in the call to the css
method I get a parser error (`on_error’: unexpected ’ ’ after ‘’
(Nokogiri::CSS::SyntaxError)). This is my file one.xml:

117.4 119.7 0.

This works too:

doc.css(“IfcCartesianPoint Coordinates IfcLengthMeasure”).each {|el|
puts el.text}

Produces:

117.4
119.7
0.

Maybe it’s the way you are passing the file to Nokogiri::XML? By the
way, in your way you are not closing the file handler. If you want to
pass Nokogiri the file instead of reading it yourself you can do:

doc = nil
File.open(“one.xml”) {|f| doc = Nokogiri::XML(f)}

or

doc = File.open(“one.xml”) {|f| Nokogiri::XML(f)}

This way, the file is properly closed.

Jesus.

kyle_x · April 12, 2011, 10:37pm

On Tue, Apr 12, 2011 at 9:39 PM, Kyle X. [email protected] wrote:

doc.css(“uosNS|IfcCartesianPoint uosNS|Coordinates
any xml file read using Nokogiri if it has a namespace you must include
that with each time you are trying to grab information from it correct
(the name space is the url in xmlns=”…" correct?)?

Yes.

automatically register those for you. You will still have to use the
I tried this using the .xml I posted and it does not work. Is this
because the xmlns is not in the first line immediately following <?xml version="1.0"?>? In turn making it necessary for every inquirary to
include {“uosNS” => “http://www.iai-tech.org/ifcXML/IFC2x3/FINAL”}?

Correct. The root node of your XML is the doc tag, which declares
namespaces, but the uos tag has its own namespace too. nokogiri will
register the ones present in the root node, as the article says. The
children nodes of uos inherit the namespace declared in the uos tag,
so this is what you have to use to search. I have not checked the
behaviour about the automatic registering of namespaces, but reading
the article this is how I understand it.

Jesus.

kyle_x · April 19, 2011, 10:42pm

Hello, Nokogiri has been going well for me but recently I have been
having trouble trying to read some xml, and from my reading online I
cannot find the proper way to write it using Nokogiri. Here are the two
lines I am having trouble with:

I am trying to get the reference for exp:pos=“1”, and I had this working
with using REXML with the following -

XPath.match( $doc, “//IfcWallStandardCase//*[@pos=‘1’]” )

With nokogiri I can get it to read both pos 0 and 1, using .css and
.xpath-

$doc_noko.css(“uosNS|IfcWallStandardCase uosNS|IfcShapeRepresentation”,
{“uosNS” => $http})
and
$doc_noko.xpath("//uosNS:IfcWallStandardCase//uosNS:IfcShapeRepresentation",
{“uosNS” => $http})

But cannot figure out how to get it to read only pos=1 using either
method and continuously get error or nil.

1. 0. 0.

The issue I am having here is that I am reading this with Nokogiri using
.xpath and the colon in exp:double is giving me trouble since the xpath
is written -

ref = “i1574”
$doc_noko.xpath("//uosNS:*[@id=’#{ref}’]//uosNS:exp:double-wrapper",
{“uosNS” => $http}).map {|element| element.text}

I am guessing that it would be easier to use .css here rather than
.xpath. So I have tried using it but cannot seem to get it correct.
Trying -

ref = “i1574”
$doc_noko.css(“uosNS|#{ref} uosNS|exp:double-wrapper”, {“uosNS” =>
$http}).map {|element| element.text}

As I read in a previous post that you call the ref using #{} for css,
but this returns nil for me.

Any ideas?

kyle_x · April 12, 2011, 9:39pm

The difference is that you have namespaces in your file. Check this URL:

http://tenderlovemaking.com/2009/04/23/namespaces-in-xml/

In order to make this work, you can do something like this:

require ‘nokogiri’

doc = Nokogiri::XML(File.read(“one.xml”))
doc.collect_namespaces.each {|key,value| puts “#{key} => #{value}”}
doc.css(“uosNS|IfcCartesianPoint uosNS|Coordinates
uosNS|IfcLengthMeasure”, {“uosNS” =>
“http://www.iai-tech.org/ifcXML/IFC2x3/FINAL”}).each {|el| puts
el.text}

(I added a line that shows all namespaces in the document). All nodes
under the uos node inherit the namespace referenced by the url you see
in the code, so in order to search for nodes within the uos node, you
need to specify the namespace.

Thank you for the response. After reading the link you provided to make
any xml file read using Nokogiri if it has a namespace you must include
that with each time you are trying to grab information from it correct
(the name space is the url in xmlns=“…” correct?)? Like you did here:

doc.css(“uosNS|IfcCartesianPoint uosNS|Coordinates
uosNS|IfcLengthMeasure”, {“uosNS” =>
“http://www.iai-tech.org/ifcXML/IFC2x3/FINAL”}).each {|el| puts
el.text}

In the link it says that, “Even though using namespaces is essential
when searching an XML document, Nokogiri tries to help out. If there are
namespaces declared on the root node of a document, Nokogiri will
automatically register those for you. You will still have to use the
prefix when searching the document, but the URL registration is done for
you.”

Making both -
1 doc.xpath(‘//xmlns:tire’,
2 ‘xmlns’ => ‘http://alicesautosupply.example.com/’
3 )
4 doc.xpath(‘//xmlns:tire’)
Equal.

I tried this using the .xml I posted and it does not work. Is this
because the xmlns is not in the first line immediately following <?xml version="1.0"?>? In turn making it necessary for every inquirary to
include {“uosNS” => “http://www.iai-tech.org/ifcXML/IFC2x3/FINAL”}?

I will have to read the whole file but it may make a crucial
difference whether it does so in one go or in chunks. Large files
might not even be readable with the File.read approach. If you pass
the file as a single string there is no choice but if you pass the
File instance nokogiri can decide what to do. This is more efficient.
Note also that because of buffering small files will have just one
(or a few) IO operations anyway.

I will try both and see how each performs against each other and if both
work properly.

Thank you both for your replies. Your help with this has been
invaluable.

Sincerely,
Kyle

kyle_x · April 20, 2011, 9:22am

This has sufficiently deviated from the title of the thread, so to make
the information more relevant for future searchers I am going to make a
new post and end this one. I hope this is acceptable.