Hpricot won't scrape! (newb question)


#1

Hey all! Just to preface, I am fairly new to RoR, and brand new to
using hpricot.

I am using the following code to scrape this xpath:
“/html/body/div/div[5]/div/div[2]/div[2]/div[2]”

from this url:
http://www.greatnonprofits.org/

Here is my code to do so (taken from igvita.com’s related blogpost):


require ‘rubygems’

require ‘open-uri’

require ‘hpricot’

@url = “http://www.greatnonprofits.org/

@response = ‘’

begin

open-uri RDoc:

http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc/index.html

open(@url, “User-Agent” => “Ruby/#{RUBY_VERSION}”,
“From” => “removed_email_address@domain.invalid”,
“Referer” => “http://www.igvita.com/blog/”) { |f|

puts "Fetched document: #{f.base_uri}"

puts "\t Content Type: #{f.content_type}\n"

puts "\t Charset: #{f.charset}\n"

puts "\t Content-Encoding: #{f.content_encoding}\n"

puts "\t Last Modified: #{f.last_modified}\n\n"



# Save the response body

@response = f.read

}

HPricot RDoc: http://code.whytheluckystiff.net/hpricot/

doc = Hpricot(@response)

Retrieve content

puts (doc/"/html/body/div/div[5]/div/div[2]/div[2]/div[2]").to_html
()

rescue Exception => e

print e, “\n”

end


In my irb terminal, I get this:


irb(main):031:0> load ‘greatnonprofitsscraper.rb’
Fetched document: http://www.greatnonprofits.org/
Content Type: text/html
Charset: utf-8
Content-Encoding:
Last Modified: Tue Mar 31 23:43:52 -0700 2009

=> true


Anyone know why this is happening? The code works with other urls/
xpaths. Can anyone specify for me why www.greatnonprofits.com is
different?

Thanks a million! I am quite frustrated, and I appreciate any help!!!


#2

On Apr 1, 7:50 am, jrgoodner removed_email_address@domain.invalid wrote:

Anyone know why this is happening? The code works with other urls/
xpaths. Can anyone specify for me whywww.greatnonprofits.comis
different?

Well, just asking the stupid question: are you sure the html on that
page matches the structure in that xpath ?

Fred


#3

jrgoodner wrote:

Hey all! Just to preface, I am fairly new to RoR, and brand new to
using hpricot.

I am using the following code to scrape this xpath:
“/html/body/div/div[5]/div/div[2]/div[2]/div[2]”

from this url:
http://www.greatnonprofits.org/

Switch to Nokogiri.