Forum: Ruby on Rails hpricot won't scrape! (newb question)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
F87606766ee70b60cc0bcb265e9e62a6?d=identicon&s=25 jrgoodner (Guest)
on 2009-04-01 10:29
(Received via mailing list)
Hey all!  Just to preface, I am fairly new to RoR, and brand new to
using hpricot.

I am using the following code to scrape this xpath:
"/html/body/div/div[5]/div/div[2]/div[2]/div[2]"

from this url:
"http://www.greatnonprofits.org/"

Here is my code to do so (taken from igvita.com's related blogpost):
*************
require 'rubygems'

require 'open-uri'

require 'hpricot'



@url = "http://www.greatnonprofits.org/"

@response = ''



begin

  # open-uri RDoc:
http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc...

  open(@url, "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "email@addr.com",
    "Referer" => "http://www.igvita.com/blog/") { |f|



    puts "Fetched document: #{f.base_uri}"

    puts "\t Content Type: #{f.content_type}\n"

    puts "\t Charset: #{f.charset}\n"

    puts "\t Content-Encoding: #{f.content_encoding}\n"

    puts "\t Last Modified: #{f.last_modified}\n\n"



    # Save the response body

    @response = f.read

  }



  # HPricot RDoc: http://code.whytheluckystiff.net/hpricot/

  doc = Hpricot(@response)



  # Retrieve content

  puts (doc/"/html/body/div/div[5]/div/div[2]/div[2]/div[2]").to_html
()





rescue Exception => e

  print e, "\n"

end
***************

In my irb terminal, I get this:

***************
irb(main):031:0> load 'greatnonprofitsscraper.rb'
Fetched document: http://www.greatnonprofits.org/
   Content Type: text/html
   Charset: utf-8
   Content-Encoding:
   Last Modified: Tue Mar 31 23:43:52 -0700 2009


=> true
***************

Anyone know why this is happening?  The code works with other urls/
xpaths.  Can anyone specify for me why www.greatnonprofits.com is
different?

Thanks a million!  I am quite frustrated, and I appreciate any help!!!
81b61875e41eaa58887543635d556fca?d=identicon&s=25 Frederick Cheung (Guest)
on 2009-04-01 10:51
(Received via mailing list)
On Apr 1, 7:50 am, jrgoodner <jrgood...@gmail.com> wrote:
>
> Anyone know why this is happening?  The code works with other urls/
> xpaths.  Can anyone specify for me whywww.greatnonprofits.comis
> different?

Well, just asking the stupid question: are you sure the html on that
page matches the structure in that xpath ?

Fred
Aafa8848c4b764f080b1b31a51eab73d?d=identicon&s=25 Phlip (Guest)
on 2009-04-01 14:50
(Received via mailing list)
jrgoodner wrote:
> Hey all!  Just to preface, I am fairly new to RoR, and brand new to
> using hpricot.
>
> I am using the following code to scrape this xpath:
> "/html/body/div/div[5]/div/div[2]/div[2]/div[2]"
>
> from this url:
> "http://www.greatnonprofits.org/"

Switch to Nokogiri.
This topic is locked and can not be replied to.