Forum: Ruby hpricot and xpath doesn't work like they should ?!?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
1685f91cc5853eb465ca50aa68b91421?d=identicon&s=25 anansi (Guest)
on 2007-07-29 20:22
(Received via mailing list)
hi,
I wanted to write me a little console tv-guide with ruby and hpricot. I
installed the firefox xpath checker plugin and went to
http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTv... . Then
I checked the xpath of these senders fields like ZDF and got:

/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]

so I tried to parse the website for this and output the hits but I don't
get any output. Here's the code:

#!/usr/bin/env ruby

$Verbose = true

require 'hpricot'
require 'net/http'

url =
URI.parse('http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTv...)
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
                               http.request(req)
                                          }

tv = Hpricot(res.body)
tv.search("/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]").each
{ |a| puts a}

#eof


Am I using hpricot in the wrong way? I thought it could handle xpaths?


--
greets

                     one must still have chaos in oneself to be able to
give birth to a dancing star
Aafa8848c4b764f080b1b31a51eab73d?d=identicon&s=25 Phlip (Guest)
on 2007-07-29 21:36
(Received via mailing list)
anansi wrote:

> Am I using hpricot in the wrong way? I thought it could handle xpaths?

Briefly, I suspect Hpricot uses an XPath subset invented on the fly to
permit querying into the HTML node space.

(This isn't a bad thing; the alternative, REXML::XPath, cannot handle
some
well-formed XHTML [according to Tidy], and certainly can't handle
traditional HTML.

(BTW: When I tried to install Hpricot 6 (ruby) on Kubuntu, the require
'hpricot' refused to find it. This might indicate a broken .so file, so
I
switched to Windows.)

The best way to use XPath is to locate tags by unique id=''. (The page
you
used abuses the IDs, as CLASSes, so it's ill-formed. But that's not your
problem here.)

Don't use long XPath chains (even if an XPath visualizer provides them),
because these locate things by incidental features that could change
when
you hit the page again. Table elements could come and go on the fly.

When I installed that XPath Checker (thanks for pointing it out!) and
hit
that page, your XPath selects ZDF, so this implicates Hpricot.

Let's find a workaround. If I want to hit, say, "Hotel Zack und Cody", I
use
Firebug's Inspect Element context menu feature, and see that blurb has a
<td
title="19:45 Hotel Zack und Cody">. So if I XPath for things like that,
we
get:

    //td[ @title ]

That sweeps for every td with a title attribute. (The View XPath feature
should have an option to find minimal and unique paths based on
attributes,
not long obsessive paths based on indices.)

And that works in Hpricot, too, to select every cell with a title.
Further
poking and parsing should get you the raw TV listings.

  tv.search("//td[ @title ]").each{     |a| p a}

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
data
feeds available somewhere?
1685f91cc5853eb465ca50aa68b91421?d=identicon&s=25 anansi (Guest)
on 2007-07-29 22:11
(Received via mailing list)
Phlip wrote:
> BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
> data feeds available somewhere?
thanks for your hint with the id-tags but what you mean with this here?
rss-feeds ? I'm not aware of any of them ..



--
greets

                     one must still have chaos in oneself to be able to
give birth to a dancing star
Aafa8848c4b764f080b1b31a51eab73d?d=identicon&s=25 Phlip (Guest)
on 2007-07-29 22:13
(Received via mailing list)
anansi wrote:

> Phlip wrote:
>> BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
>> data feeds available somewhere?
> thanks for your hint with the id-tags but what you mean with this here?
> rss-feeds ? I'm not aware of any of them ..

That's what I mean - neither am I aware of any. But the TV guide
services
get their data from somewhere, and (under the wild assumption that TV
programmers want you to find their shows and watch them) these feeds
should
not be proprietary.

But note that electronic TV guides predate RSS...
15a5043475dac9278ae75efb4c71f1f6?d=identicon&s=25 Felix Windt (Guest)
on 2007-07-29 22:33
(Received via mailing list)
> thanks for your hint with the id-tags but what you mean with
>
http://www.klack.de/TvKlackRSS.php

Though there aren't any that fit your bill of "generic evening
programming".
1685f91cc5853eb465ca50aa68b91421?d=identicon&s=25 anansi (Guest)
on 2007-07-30 12:04
(Received via mailing list)
Felix Windt wrote:
>
> http://www.klack.de/TvKlackRSS.php
>
> Though there aren't any that fit your bill of "generic evening programming".
>

yeah I can't find one rss for a generic tv-guide too..

--
greets

                     one must still have chaos in oneself to be able to
give birth to a dancing star
This topic is locked and can not be replied to.