Hpricot won't scrape! (newb question)

jrgoodner · April 1, 2009, 10:29am

Hey all! Just to preface, I am fairly new to RoR, and brand new to
using hpricot.

I am using the following code to scrape this xpath:
“/html/body/div/div[5]/div/div[2]/div[2]/div[2]”

from this url:
“http://www.greatnonprofits.org/”

Here is my code to do so (taken from igvita.com’s related blogpost):

require ‘rubygems’

require ‘open-uri’

require ‘hpricot’

@url = “http://www.greatnonprofits.org/”

@response = ‘’

begin

open-uri RDoc:

http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc/index.html

open(@url, “User-Agent” => “Ruby/#{RUBY_VERSION}”,
“From” => “[email protected]”,
“Referer” => “Ilya Grigorik”) { |f|

puts "Fetched document: #{f.base_uri}"

puts "\t Content Type: #{f.content_type}\n"

puts "\t Charset: #{f.charset}\n"

puts "\t Content-Encoding: #{f.content_encoding}\n"

puts "\t Last Modified: #{f.last_modified}\n\n"



# Save the response body

@response = f.read

}

HPricot RDoc: http://code.whytheluckystiff.net/hpricot/

doc = Hpricot(@response)

Retrieve content

puts (doc/“/html/body/div/div[5]/div/div[2]/div[2]/div[2]”).to_html
()

rescue Exception => e

print e, “\n”

end

In my irb terminal, I get this:

irb(main):031:0> load ‘greatnonprofitsscraper.rb’
Fetched document: http://www.greatnonprofits.org/
Content Type: text/html
Charset: utf-8
Content-Encoding:
Last Modified: Tue Mar 31 23:43:52 -0700 2009

=> true

Anyone know why this is happening? The code works with other urls/
xpaths. Can anyone specify for me why www.greatnonprofits.com is
different?

Thanks a million! I am quite frustrated, and I appreciate any help!!!

jrgoodner · April 1, 2009, 10:51am

On Apr 1, 7:50 am, jrgoodner [email protected] wrote:

Anyone know why this is happening? The code works with other urls/
xpaths. Can anyone specify for me whywww.greatnonprofits.comis
different?

Well, just asking the stupid question: are you sure the html on that
page matches the structure in that xpath ?

Fred

jrgoodner · April 1, 2009, 2:50pm

jrgoodner wrote:

Hey all! Just to preface, I am fairly new to RoR, and brand new to
using hpricot.

I am using the following code to scrape this xpath:
“/html/body/div/div[5]/div/div[2]/div[2]/div[2]”

from this url:
“http://www.greatnonprofits.org/”

Switch to Nokogiri.