Yahoo API and Ruby

I’m working on a couple of large sites that aren’t sending the correct
response codes for missing pages. I want to use the Yahoo API key to
search Yahoo’s cache to see if it has any clues about what pages are
sending bad responses. So I need to get all 1000 results from the
Yahoo cache and write it to a spreadsheet. Then I can sort the URL
data and response codes in the spreadsheet.

I can only request 100 results at a time and can set a different
“start” number for each request. The first request would be start=1,
the second, start=101 and so on.

The other problem is that it won’t get the response codes. I just get
this unhelpful error message:
c:/ruby/lib/ruby/1.8/net/http.rb:1467:in initialize': HTTP request path y (ArgumentError) from c:/ruby/lib/ruby/1.8/net/http.rb:1585:in initialize’
from hpricot_test.rb:32:in new' from hpricot_test.rb:32:in get_headers’
from hpricot_test.rb:80:in generate_workbook from hpricot_test.rb:70:in each’
from hpricot_test.rb:70:in `generate_workbook
from hpricot_test.rb:94

Here is the code:

#!/usr/bin/ruby -w

require ‘net/http’
require ‘uri’
require ‘hpricot’
require ‘spreadsheet/excel’
include Spreadsheet

def get_cache
# set variables for POST request
appid = ‘yahooAPI-key’ # a Yahoo API key goes here
query = ‘http://www.example.com’ # a Web site to check goes here

# this gets the first 100 results, but I want to loop through
#  it 10 times with a different "start" number to get all 1000
#  available results
results = 100
start = 1

post_args = {
  'appid' => appid,
  'query' => query,
  'results' => results,
  'start' => start
}
url =

URI.parse(‘http://search.yahooapis.com/SiteExplorerService/V1/pageData’)

# send post request
@resp, @data = Net::HTTP.post_form(url, post_args)

# read XML
@doc = Hpricot(@data)

end

def get_headers(url)
# This gets the response code for the page to see if it exists
(200, 301, 404, etc.)
page = URI.parse(url)
req = Net::HTTP::Get.new(page.path)
res = Net::HTTP.start(page.host, page.port) { |http|
http.request(req)
}
return res.code
end

def generate_workbook
# create new workbook and worksheet
workbook = Spreadsheet::Excel.new(“yahoo_cache.xls”)
worksheet = workbook.add_worksheet(‘Yahoo Cache’)

# set variables
current_row = 2
format_nil = nil
format_header = Format.new(
  :color => 'white',
  :bg_color => 'gray',
  :bold  => true
)
workbook.add_format(format_header)
workbook.add_format(format_nil)

# Add header row
worksheet.write(0,0,"Yahoo's Cache for Site", format_nil)
worksheet.write(1,0,"TITLE",format_header)
worksheet.write(1,1,"URL", format_header)
# worksheet.write(1,2,"CODE", format_header)
# worksheet.write(1,3,"LOCATION", format_header) # coming soon

# Add xml_data to worksheet
(@doc/"result").each do |el|
  result_title = (el/"title").text
  result_url   = (el/"url").text
  worksheet.write(current_row, 0, result_title, format_nil)
  worksheet.write(current_row, 1, result_url, format_nil)

  # get response codes -- this is causing an error with

“result_url” – maybe it isn’t a URL in a string?
# see error message at top of this post
# response_code ||= 0
# response_code = get_headers(result_url) # this works if I put
a URL here, but not with the result_url variable
# worksheet.write(current_row, 2, response_code, format_nil)

  # move to the next row in the spreadsheet before going to the

next XML item
current_row += 1
end

# finished, close the workbook
workbook.close

end

====
The above code works (except the part that gets response codes). The
following code is a previous version where I tried to loop through all
1000 results. (It was using xmlsimple.) I couldn’t figure out how to
store each set of XML – each request is an entire XML file. I tried
@pass[count], but it wasn’t working. Any ideas about a good way to
store each request?

# prepare to loop through 100 results
count = 1
start = 1

# pass[] = each of the 10 requests to Yahoo
@pass = []

# perform the loop
while count < 11 do
  post_args = {
    'appid' => appid,
    'query' => query,
    'results' => results,
    'start' => start
  }

  # send post request
  @resp, @data = Net::HTTP.post_form(url, post_args)

  # read XML
  xml_data = XmlSimple.xml_in(@data)
  @pass[count] = xml_data
  # puts "Count: #{count}"
  # print @pass[count]

  # puts "Start: #{start}"
  # puts
  count += 1
  start += 100
end