This code is conceptually what I want to do with the nokogiri code below
s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts “"
puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
puts "*”
end
Results as expected
3
6
9
I have an html page with multiple
…
elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class="itemNumbr …> from the
table (equivalent to extracting the middle element in any of s1 s2 s3).
I initially thought this was straight forward - but I am missing
something very fundamental when I move the concept to Nokogiri objects
---------------------- NOKOGIRI CODE ----------------------
require ‘open-uri’
require ‘nokogiri’
doc = Nokogiri::HTML(open(“c:/RUBY_OUT.TXT”)); # file containing web
page
doc.xpath("//table[@class=‘result’]").each do |node| # select a table
puts “"
puts node.to_html # as expected
puts node.xpath("//td[@class=‘itemNumbr’]") # 15 per each
puts "”
end
---------------------- NOKOGIRI CODE ----------------------
The output below dispays the table HTML as expected - but not itemnumbrs
for item 1
<td class ="itemNumbr.....<b1> 1.</b>...../td>
<td class ="itemNumbr..... 2....../td>
......
<td class ="itemNumbr.....<b1> 15.</b>...../td>
**********
**********
<table ..................../table> for item 2
<td class ="itemNumbr..... 1....../td>
<td class ="itemNumbr.....<b1> 2.</b>...../td>
......
<td class ="itemNumbr..... 15....../td>
**********
**********
for item 3
<td class ="itemNumbr.....<b1> 1.</b>...../td>
.......
<p>The tables are outputted as expected Tables with itemnumbr 1 to 15<br>
sequentially.<br>
The node.xpath("//td[@class=‘itemNumbr’]") acts as if node contains all<br>
15 tables but the output indicates otherwise. I think node should<br>
always contain HTML for a single table only, but I appear to be wrong.</p>
<p>Also if i put a subscript on the first xpath<br>
doc.xpath("//table[@class=‘result’][5]").each do |node|<br>
to ensure only one table is found, still get itemnumbrs for all 15 table<br>
elements</p>
<p>WHAT AM I MISSING HERE</p>
3
s1, s2, s3) and extract one item <td[ class=“itemNumbr …> from the
doc.xpath(”//table[@class=‘result’]").each do |node| # select a table
<td class ="itemNumbr.....<b1> 1.</b>...../td>
**********
<p>Also if i put a subscript on the first xpath<br>
doc.xpath(“//table[@class=‘result’][5]”).each do |node|<br>
to ensure only one table is found, still get itemnumbrs for all 15 table<br>
elements</p>
<p>WHAT AM I MISSING HERE</p>
</blockquote>
<p>Please do not shout. You are probably missing how XPath works. With<br>
the queries given by you above you will always get <em>all</em> td nodes with<br>
class “itemNumbr” in the document. You need a two level approach<br>
using <em>relative</em> queries:</p>
<p>irb(main):004:0> html = “<body>” << (1…3).map {|i|<br>
“<table><td>#{i}</td></table>”}.join(" ") << “</body>”<br>
=> "<body><table><td>1</td></table> <table><td>2</td></table></p>
<table><td>3</td></table></body>"
I realized (even though I can not figure out why) early on that I had to
save off each table in the “do” before processing it to get around this
problem. I tried many things including an array which worked fine to
save the
s but I could not xpath the saved
s.
What I did not realize that if I took the
raw it was no longer
valid XML. What twigged me is your code adding in to give the
correct html header
Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, the new object contains
nothing other than a single
.
Also if print the node it contains only a single
.
Yet if I attempt to execute
“puts node.xpath(”//td[@class=‘itemNumbr’]")"
it finds the “itemnumbr” for all
items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.
puts “"
puts doc2.xpath(“//td[@class=‘itemNumbr’]”)
puts "”
end
I realized (even though I can not figure out why) early on that I had to
save off each table in the “do” before processing it to get around this
problem. I tried many things including an array which worked fine to
save the
s but I could not xpath the saved
s.
As I said: you need a relative XPath. Your problem is the global
XPath. You need to shave off the leading “//” or prefix it with “.”.
doc = Nokogiri::HTML(open(“c:/RUBY_OUT.TXT”));
doc.xpath(“//table[@class=‘result’]”).each do |node|
puts “"
puts node.xpath(“td[@class=‘itemNumbr’]”)
puts "”
end
doc = Nokogiri::HTML(open(“c:/RUBY_OUT.TXT”));
doc.xpath(“//table[@class=‘result’]”).each do |node|
puts “"
puts node.xpath(“.//td[@class=‘itemNumbr’]”)
puts "”
end
Yet if I attempt to execute
“puts node.xpath(”//td[@class=‘itemNumbr’]“)”
it finds the “itemnumbr” for all
items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.
My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.
My only issue is conceptual. I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
node.xpath("//td[@class=‘itemNumbr’]")
What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).
This is my last post - I really have no issue other than the conceptual
one above and I will re-visit the contents of node again when I have
time.
used the array concept which is obviously not a parallel.
Obviously I still do not understand the contents of node.
From this posting of yours it’s not clear to me what issue you have.
Any XML or HTML parser that rips a document apart and builds a DOM of
some kind will create a nested, strictly hierarchical tree structure.
The only thing that may seem odd is that XPath queries beginning with
“//” search through the complete document regardless of the node you
invoke the method on.
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).
The problem might lie in the term “contains”. Conceptually one would
probably say that a node contains all its sub nodes. Technically a
node can also (indirectly) contain the whole document. This happens
if you include a reference to the parent node or the document. Here’s
an example with parent node inclusion.
If you add a line “pp ch” to the iteration code at the end of the
file, you will see that each node “contains” all the rest of the
document.
This is my last post - I really have no issue other than the conceptual
From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.
That is why I was not looking at relative paths since I thought I was
dealing with only a single table on “each”. That is why I mistakenly
used the array concept which is obviously not a parallel.
Obviously I still do not understand the contents of node.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.