Hpricot and xpath doesn't work like they should ?!?

anansi · July 29, 2007, 8:22pm

hi,
I wanted to write me a little console tv-guide with ruby and hpricot. I
installed the firefox xpath checker plugin and went to
http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEvening.php3 . Then
I checked the xpath of these senders fields like ZDF and got:

/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]

so I tried to parse the website for this and output the hits but I don’t
get any output. Here’s the code:

#!/usr/bin/env ruby

$Verbose = true

require ‘hpricot’
require ‘net/http’

url =
URI.parse(‘http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEvening.php3’)
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
http.request(req)
}

tv = Hpricot(res.body)
tv.search(“/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]”).each
{ |a| puts a}

#eof

Am I using hpricot in the wrong way? I thought it could handle xpaths?

–
greets

                 one must still have chaos in oneself to be able to

give birth to a dancing star

anansi · July 29, 2007, 10:11pm

Phlip wrote:

BTW scraping TV guide listings is … kind’a tacky. Aren’t the actual
data feeds available somewhere?
thanks for your hint with the id-tags but what you mean with this here?
rss-feeds ? I’m not aware of any of them …

–
greets

                 one must still have chaos in oneself to be able to

give birth to a dancing star

anansi · July 29, 2007, 9:36pm

anansi wrote:

Am I using hpricot in the wrong way? I thought it could handle xpaths?

Briefly, I suspect Hpricot uses an XPath subset invented on the fly to
permit querying into the HTML node space.

(This isn’t a bad thing; the alternative, REXML::XPath, cannot handle
some
well-formed XHTML [according to Tidy], and certainly can’t handle
traditional HTML.

(BTW: When I tried to install Hpricot 6 (ruby) on Kubuntu, the require
‘hpricot’ refused to find it. This might indicate a broken .so file, so
I
switched to Windows.)

The best way to use XPath is to locate tags by unique id=’’. (The page
you
used abuses the IDs, as CLASSes, so it’s ill-formed. But that’s not your
problem here.)

Don’t use long XPath chains (even if an XPath visualizer provides them),
because these locate things by incidental features that could change
when
you hit the page again. Table elements could come and go on the fly.

When I installed that XPath Checker (thanks for pointing it out!) and
hit
that page, your XPath selects ZDF, so this implicates Hpricot.

Let’s find a workaround. If I want to hit, say, “Hotel Zack und Cody”, I
use
Firebug’s Inspect Element context menu feature, and see that blurb has a

. So if I XPath for things like that, we get:

//td[ @title ]

That sweeps for every td with a title attribute. (The View XPath feature
should have an option to find minimal and unique paths based on
attributes,
not long obsessive paths based on indices.)

And that works in Hpricot, too, to select every cell with a title.
Further
poking and parsing should get you the raw TV listings.

tv.search("//td[ @title ]").each{ |a| p a}

BTW scraping TV guide listings is … kind’a tacky. Aren’t the actual
data
feeds available somewhere?

anansi · July 29, 2007, 10:33pm

thanks for your hint with the id-tags but what you mean with

http://www.klack.de/TvKlackRSS.php

Though there aren’t any that fit your bill of “generic evening
programming”.

anansi · July 30, 2007, 12:04pm

Felix W. wrote:

http://www.klack.de/TvKlackRSS.php

Though there aren’t any that fit your bill of “generic evening programming”.

yeah I can’t find one rss for a generic tv-guide too…

–
greets

                 one must still have chaos in oneself to be able to

give birth to a dancing star

anansi · July 29, 2007, 10:13pm

anansi wrote:

Phlip wrote:

BTW scraping TV guide listings is … kind’a tacky. Aren’t the actual
data feeds available somewhere?
thanks for your hint with the id-tags but what you mean with this here?
rss-feeds ? I’m not aware of any of them …

That’s what I mean - neither am I aware of any. But the TV guide
services
get their data from somewhere, and (under the wild assumption that TV
programmers want you to find their shows and watch them) these feeds
should
not be proprietary.

But note that electronic TV guides predate RSS…