Help missing something BASIC

okkezSS · October 20, 2010, 3:40am

This code is conceptually what I want to do with the nokogiri code below
s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts “"
puts " #{itm[2]}" Select middle item from each s1 , s2 ,s3
puts "*”
end
Results as expected

3

6

9

I have an html page with multiple

…

elements
(equivalent to str above) and want to process each table (equivalent to
s1, s2, s3) and extract one item <td[ class="itemNumbr …> from the
table (equivalent to extracting the middle element in any of s1 s2 s3).

I initially thought this was straight forward - but I am missing
something very fundamental when I move the concept to Nokogiri objects
---------------------- NOKOGIRI CODE ----------------------
require ‘open-uri’
require ‘nokogiri’

doc = Nokogiri::HTML(open(“c:/RUBY_OUT.TXT”)); # file containing web
page
doc.xpath("//table[@class=‘result’]").each do |node| # select a table
puts “"
puts node.to_html # as expected
puts node.xpath("//td[@class=‘itemNumbr’]") # 15 per each
puts "”
end
---------------------- NOKOGIRI CODE ----------------------

The output below dispays the table HTML as expected - but not itemnumbrs

for item 1 <td class ="itemNumbr.....<b1> 1.</b>...../td> <td class ="itemNumbr..... 2....../td> ...... <td class ="itemNumbr.....<b1> 15.</b>...../td> ********** ********** <table ..................../table> for item 2 <td class ="itemNumbr..... 1....../td> <td class ="itemNumbr.....<b1> 2.</b>...../td> ...... <td class ="itemNumbr..... 15....../td> ********** **********

for item 3 <td class ="itemNumbr.....<b1> 1.</b>...../td> ....... <p>The tables are outputted as expected Tables with itemnumbr 1 to 15<br> sequentially.<br> The node.xpath("//td[@class=‘itemNumbr’]") acts as if node contains all<br> 15 tables but the output indicates otherwise. I think node should<br> always contain HTML for a single table only, but I appear to be wrong.</p> <p>Also if i put a subscript on the first xpath<br> doc.xpath("//table[@class=‘result’][5]").each do |node|<br> to ensure only one table is found, still get itemnumbrs for all 15 table<br> elements</p> <p>WHAT AM I MISSING HERE</p>

rustysam · October 20, 2010, 3:50am

Posted incorrect code for number array and should have said last item
not
middle item

s1 = [1,2,3] ; s2 = [4,5,6]; s3 = [7,8,9]
str = [s1,s2,s3]
str.each do |itm|
puts “********
puts itm.to_s # ADDED LINE
puts " #{itm[2]}” Select middle item from each s1 , s2 ,s3
puts “*********”
end

Giving the output below - more in line with table & item printout

[1, 2, 3] # table
3 # item selected

rustysam · October 20, 2010, 9:49am

On Wed, Oct 20, 2010 at 3:40 AM, Don N. [email protected] wrote:

3
s1, s2, s3) and extract one item <td[ class=“itemNumbr …> from the
doc.xpath(”//table[@class=‘result’]").each do |node| # select a table
<td class ="itemNumbr.....<b1> 1.</b>...../td> ********** <p>Also if i put a subscript on the first xpath<br> doc.xpath(“//table[@class=‘result’][5]”).each do |node|<br> to ensure only one table is found, still get itemnumbrs for all 15 table<br> elements</p> <p>WHAT AM I MISSING HERE</p> </blockquote> <p>Please do not shout. You are probably missing how XPath works. With<br> the queries given by you above you will always get <em>all</em> td nodes with<br> class “itemNumbr” in the document. You need a two level approach<br> using <em>relative</em> queries:</p> <p>irb(main):004:0> html = “<body>” << (1…3).map {|i|<br> “<table><td>#{i}</td></table>”}.join(" ") << “</body>”<br> => "<body><table><td>1</td></table> <table><td>2</td></table></p> <table><td>3</td></table></body>"
irb(main):005:0> doc = Nokogiri.parse html
=> #<Nokogiri::XML::Document:0x8231e0c name=“document”
children=[#<Nokogiri::XML::Element:0x8231b6c name=“body”
children=[#<Nokogiri::XML::Element:0x82319e4 name=“table”
children=[#<Nokogiri::XML::Element:0x823185c name=“td”
children=[#<Nokogiri::XML::Text:0x82316d4 “1”>]>]>,
#<Nokogiri::XML::Text:0x8231496 " ">,
#<Nokogiri::XML::Element:0x82313fc name=“table”
children=[#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]>,
#<Nokogiri::XML::Text:0x8230eae " ">,
#<Nokogiri::XML::Element:0x8230e14 name=“table”
children=[#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]>]>]>

irb(main):006:0> doc.xpath ‘//td’
=> [#<Nokogiri::XML::Element:0x823185c name=“td”
children=[#<Nokogiri::XML::Text:0x82316d4 “1”>]>,
#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>,
#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]

irb(main):007:0> doc.xpath(‘//table’).each do |tab|
irb(main):008:1* p tab, tab.xpath(‘td’) # relative!
irb(main):009:1> puts ‘----’
irb(main):010:1> end
#<Nokogiri::XML::Element:0x82319e4 name=“table”
children=[#<Nokogiri::XML::Element:0x823185c name=“td”
children=[#<Nokogiri::XML::Text:0x82316d4 “1”>]>]>
[#<Nokogiri::XML::Element:0x823185c name=“td”
children=[#<Nokogiri::XML::Text:0x82316d4 “1”>]>]

#<Nokogiri::XML::Element:0x82313fc name=“table”
children=[#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]>
[#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]

#<Nokogiri::XML::Element:0x8230e14 name=“table”
children=[#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]>
[#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]

=> 0

Cheers

robert

rustysam · October 20, 2010, 4:14pm

Thanks Robert

I now have the code working with one additional line.

require ‘open-uri’
require ‘nokogiri’

doc = Nokogiri::HTML(open(“c:/RUBY_OUT.TXT”));
doc.xpath("//table[@class=‘result’]").each do |node|

next line has been added.

doc2 = Nokogiri::HTML("<body>" << "#{node}" << "</body>")

puts "*************"
puts doc2.xpath("//td[@class='itemNumbr']")
puts "*************"

end

I realized (even though I can not figure out why) early on that I had to
save off each table in the “do” before processing it to get around this
problem. I tried many things including an array which worked fine to
save the

s but I could not xpath the saved

s.

What I did not realize that if I took the

raw it was no longer
valid XML. What twigged me is your code adding in to give the
correct html header

Now if you can shed some light on my underlying question.
Above I embed the node in a new html object, the new object contains
nothing other than a single

.

Also if print the node it contains only a single

.

Yet if I attempt to execute
“puts node.xpath(”//td[@class=‘itemNumbr’]")"
it finds the “itemnumbr” for all

items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.

Thanks for any insight you can provide.
Don

rustysam · October 20, 2010, 6:10pm

Thanks for your help

puts node.xpath(“.//td[@class=‘itemNumbr’]”)

which selects current node first works nicely

I will spend some time at
http://www.w3schools.com/xpath/xpath_syntax.asp

Thanks Don

rustysam · October 20, 2010, 5:30pm

On Wed, Oct 20, 2010 at 4:14 PM, Don N. [email protected] wrote:

next line has been added.

doc2 = Nokogiri::HTML(“” << “#{node}” << “”)

I’m sorry but this is ridiculous!

puts “"
puts doc2.xpath(“//td[@class=‘itemNumbr’]”)
puts "”
end

I realized (even though I can not figure out why) early on that I had to
save off each table in the “do” before processing it to get around this
problem. I tried many things including an array which worked fine to
save the
s but I could not xpath the saved
s.

As I said: you need a relative XPath. Your problem is the global
XPath. You need to shave off the leading “//” or prefix it with “.”.

doc = Nokogiri::HTML(open(“c:/RUBY_OUT.TXT”));
doc.xpath(“//table[@class=‘result’]”).each do |node|
puts “"
puts node.xpath(“td[@class=‘itemNumbr’]”)
puts "”
end

doc = Nokogiri::HTML(open(“c:/RUBY_OUT.TXT”));
doc.xpath(“//table[@class=‘result’]”).each do |node|
puts “"
puts node.xpath(“.//td[@class=‘itemNumbr’]”)
puts "”
end

Yet if I attempt to execute
“puts node.xpath(”//td[@class=‘itemNumbr’]“)”
it finds the “itemnumbr” for all

items even though they do not
exist in node. Does node actually contain the entire html, just not
visible.

http://www.w3schools.com/xpath/xpath_syntax.asp

Cheers

robert

rustysam · October 20, 2010, 6:50pm

On 20.10.2010 18:10, Don N. wrote:

Thanks for your help

puts node.xpath(“.//td[@class=‘itemNumbr’]”)

which selects current node first works nicely

I will spend some time at
http://www.w3schools.com/xpath/xpath_syntax.asp

There’s also

http://zvon.org/xxl/XPathTutorial/General/examples.html
http://www.tizag.com/xmlTutorial/xpathtutorial.php

And tons more.

Cheers

robert

rustysam · October 21, 2010, 12:18am

My issue is not with anything you have shown me and not with my ability
to get the code working or get the next piece code working predictably.

My only issue is conceptual. I have no problem understanding that node
contains the entire tree structure since I am able to return all the
itemNumbr elements with
node.xpath("//td[@class=‘itemNumbr’]")

What I am not able to do is output proof to convince myself that the
node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).

This is my last post - I really have no issue other than the conceptual
one above and I will re-visit the contents of node again when I have
time.

Thanks again for all your help.

rustysam · October 20, 2010, 10:41pm

On 20.10.2010 21:46, Don N. wrote:

used the array concept which is obviously not a parallel.

Obviously I still do not understand the contents of node.

From this posting of yours it’s not clear to me what issue you have.
Any XML or HTML parser that rips a document apart and builds a DOM of
some kind will create a nested, strictly hierarchical tree structure.
The only thing that may seem odd is that XPath queries beginning with
“//” search through the complete document regardless of the node you
invoke the method on.

Cheers

robert

rustysam · October 21, 2010, 10:01am

On Thu, Oct 21, 2010 at 12:18 AM, Don N. [email protected] wrote:

node contains other than the selected table (ie inspect, parse, to_s,
to_html all return the single table).

The problem might lie in the term “contains”. Conceptually one would
probably say that a node contains all its sub nodes. Technically a
node can also (indirectly) contain the whole document. This happens
if you include a reference to the parent node or the document. Here’s
an example with parent node inclusion.

gist.github.com

https://gist.github.com/rklemme/638085

node.rb

require 'pp'

Node = Struct.new :value, :parent, :children do
  def initialize(value = nil, parent = nil)
    self.value = value
    self.children = []
    yield self if block_given?
  end

  def add(child)

This file has been truncated. show original

If you add a line “pp ch” to the iteration code at the end of the
file, you will see that each node “contains” all the rest of the
document.

This is my last post - I really have no issue other than the conceptual

Hopefully not.

Thanks again for all your help.

Your welcome!

Kind regards

robert

rustysam · October 20, 2010, 9:46pm

I have actualy taken this first tutorial plus a few more

XPath 教程

From everything I did to verify the contents of
node(parse,to_s,to_html), I thought it to contain a single table
selected from the html page and could not prove different.

That is why I was not looking at relative paths since I thought I was
dealing with only a single table on “each”. That is why I mistakenly
used the array concept which is obviously not a parallel.

Obviously I still do not understand the contents of node.

Help missing something BASIC

#<Nokogiri::XML::Element:0x82313fc name=“table”
children=[#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]>
[#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]

#<Nokogiri::XML::Element:0x8230e14 name=“table”
children=[#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]>
[#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]

next line has been added.

next line has been added.

Help missing something BASIC

#<Nokogiri::XML::Element:0x82313fc name=“table” children=[#<Nokogiri::XML::Element:0x8231274 name=“td” children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]> [#<Nokogiri::XML::Element:0x8231274 name=“td” children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]

#<Nokogiri::XML::Element:0x8230e14 name=“table” children=[#<Nokogiri::XML::Element:0x8230c8c name=“td” children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]> [#<Nokogiri::XML::Element:0x8230c8c name=“td” children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]

next line has been added.

next line has been added.

#<Nokogiri::XML::Element:0x82313fc name=“table”
children=[#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]>
[#<Nokogiri::XML::Element:0x8231274 name=“td”
children=[#<Nokogiri::XML::Text:0x82310ec “2”>]>]

#<Nokogiri::XML::Element:0x8230e14 name=“table”
children=[#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]>
[#<Nokogiri::XML::Element:0x8230c8c name=“td”
children=[#<Nokogiri::XML::Text:0x8230b04 “3”>]>]