Hpricot scraping returns nil


#1

Good evening

First I’ll mention I have used the search function and found some useful
topics, but I still don’t really find a solution due to a lack of Ruby
and Hpricot/Xpath knowlegde.

The problem is the following: from
http://users.telenet.be/weerstation.drongen/index.htm/Current_Vantage_Pro.htm
I need to scrape the temperature and Today’s Rain values (need those for
Engineering Project). With Xpather and Firebug I looked up the Xpath to
the Temperature values:
/html/body/table/tbody/tr[3]/td[2]/font/strong/small/font (as Xpather
says so).

But when I try to print the value in Ruby, I got nil.

Here is my code:


#!/usr/bin/ruby
require ‘rubygems’
require ‘open-uri’
require ‘hpricot’

@url=“http://users.telenet.be/weerstation.drongen/index.htm/Current_Vantage_Pro.htm
xpath = “/html/body/table/tbody/tr[3]/td[2]/font/strong/small/font”
@response=""

begin
open(@url) {|file|
puts “Fetched Document: #{file.base_uri}”
@response = file.read
}

doc = Hpricot(@response)
puts (doc/"#{xpath}").inner_html
rescue Exception => e
puts e
end


Since this returned nil, I decided to look up where I got nil returned.
Apparently /html/body/table/tbody is too far, because /html/body/table
still returns an output and tbody returns nil.

I’ve read that I should try to rebuild the path now, but I really don’t
find a way how to do this. This is only my second serious Ruby script
(only the beginning actually) and the first time I used Hpricot.

I’m looking forward to replies, and I’m sorry to bother you with yet
another Hpricot-nil topic, but I’m kinda hopeless because of my
deadline…

Kind regards,
Sergei


#2

It should work if you take the tbody off the xpath. I have read
somewhere that tbody does not work for hpricot , I dont know Y .
Gudluck.
xpath = “/html/body/table//tr[3]/td[2]/font/strong/small/font”


#3

It should work if you take the tbody off the xpath. I have read
somewhere that tbody does not work for hpricot , I dont know Y .
Gudluck.
xpath = “/html/body/table//tr[3]/td[2]/font/strong/small/font”

Posted via http://www.ruby-forum.com/.

There is more to it than “tbody does not work for hpricot”.

When a HTML parser (Firefox and Hpricot in this case) parses a HTML
page, it has to build a tree from it (a.k.a. DOM).
The problem is that a lot (most?) of the HTML out there is badly
formatted, so the process of DOM building is very ambiguous (what if
tags are not nested properly? tags that are never closed? and a lot of
other problems) so every parser approaches it a bit differently
(that’s one reason why you have the ‘works in IE but not in FF’ kind
of problems), and e.g. Firefox even makes some efforts to make the
parsed HTML standards compliant - for example inserting a tbody tag
after a table tag if it’s missing.

However, this is but only very small difference between how Hpricot
and Firefox parses the HTML/builds the DOM tree (on which XPaths are
evaluated) - Hpricot tries to be as close to FF as possible, but this
doesn’t always happen (though _why said he considers these cases bugs).

Bottom line: you can’t expect that XPath yanked from FireBug will work
with Hpricot/Mechanize (though it mostly does, and adding a tbody
increases your chances even further).

Cheers,
Peter


http://www.rubyrailways.com
http://scrubyt.org


#4

Sergei Maertens wrote:

I’ll try it in a minute, thank you for the answer.

and it does work! Thank you very much Jn Jakob
Now I only have to solve the ‘�’ that appears instead of ‘°’.


#5

Jn Jacob wrote:

It should work if you take the tbody off the xpath. I have read
somewhere that tbody does not work for hpricot , I dont know Y .
Gudluck.
xpath = “/html/body/table//tr[3]/td[2]/font/strong/small/font”

I’ll try it in a minute, thank you for the answer.

@Peter, thank you for the very complete explanation.