Hey all! Just to preface, I am fairly new to RoR, and brand new to
using hpricot.
I am using the following code to scrape this xpath:
“/html/body/div/div[5]/div/div[2]/div[2]/div[2]”
from this url:
“http://www.greatnonprofits.org/”
Here is my code to do so (taken from igvita.com’s related blogpost):
require ‘rubygems’
require ‘open-uri’
require ‘hpricot’
@url = “http://www.greatnonprofits.org/”
@response = ‘’
begin
open-uri RDoc:
http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc/index.html
open(@url, “User-Agent” => “Ruby/#{RUBY_VERSION}”,
“From” => “[email protected]”,
“Referer” => “Ilya Grigorik”) { |f|
puts "Fetched document: #{f.base_uri}"
puts "\t Content Type: #{f.content_type}\n"
puts "\t Charset: #{f.charset}\n"
puts "\t Content-Encoding: #{f.content_encoding}\n"
puts "\t Last Modified: #{f.last_modified}\n\n"
# Save the response body
@response = f.read
}
HPricot RDoc: http://code.whytheluckystiff.net/hpricot/
doc = Hpricot(@response)
Retrieve content
puts (doc/“/html/body/div/div[5]/div/div[2]/div[2]/div[2]”).to_html
()
rescue Exception => e
print e, “\n”
end
In my irb terminal, I get this:
irb(main):031:0> load ‘greatnonprofitsscraper.rb’
Fetched document: http://www.greatnonprofits.org/
Content Type: text/html
Charset: utf-8
Content-Encoding:
Last Modified: Tue Mar 31 23:43:52 -0700 2009
=> true
Anyone know why this is happening? The code works with other urls/
xpaths. Can anyone specify for me why www.greatnonprofits.com is
different?
Thanks a million! I am quite frustrated, and I appreciate any help!!!