I have written ruby code (with mechanize and nokogiri) to do the
following
- Retrieve the search webpage
- Enter search criteria into the from
- Submit the form and retrieve the first webpage which is a list of
book titles embedded in the page - For each title in the retrieved web page extract 5 fields
- Retrieve the next webpage of titles
- Repeat 4 & 5 until all titles retrieved
The mechanize code below works to the point of submitting the form. The
first webpage returned is missing at least 2 of the
Fields for each title.
Now if I grab the url generated by mech.submit and use it in firefox it
displays all the titles and information normally BUT
the URL has been changed slightly before the titles are displayed.
THIS IS THE URL RETURNED BY MECHANIZE.SUBMIT
#<URI::HTTP:0x17706d8
URL:.xyz Domain Names | Join Generation XYZ>}
Now if I take the URL from the submit and use it in the nokogiri code
below it fails to open with BAD URI.
Also if take the URL from fire fox and use it in the nokogiri code
below it also fails to open with BAD URI.
Now if I start off in firefox at the search page and enter the same data
into the form and submit it manually I wind up with the
same screen displayed as when I cut and pasted in the url from the
mechanize.submit code.
If I now copy the url from firefox and use it in the nokogiri code below
it works fine and the “puts node.text” shows that
all 5 of the fields I require are there (plus others not present in the
mechanize object)
Now the urls from the 3 steps above only differ in one way, the last
variable (sts) on the url line.
&sortby=17&sts=t>}" from mechanize.submit
&sortby=17&sts=t%3E}" copied from firefox after submit url used and
webpage displayed (changed url)
&sortby=17&sts=t&x=84&y=10" manualy entered the search and this is
the url upon display of first page
The attached file shows what the source (from web page) for the last
title looks like and what the mechanize content for that same title
looks like.
THE CONTENTS OF BOTH
AND
Can anyone shed light on what is happening. It would be greatly
appreciated.
Thanks Don
#MECHANIZE CODE
require ‘rubygems’
require ‘open-uri’
require ‘nokogiri’
require ‘mechanize’
url = “…” # url of search form
a = Mechanize.new { |agent|
agent.user_agent_alias = ‘Mac Safari’;
};
search_page = a.get(url);
search_form = search_page.form_with(:name => ‘form-advancedSearch’)
search_form.an = ‘Asimov’
search_form.kn = ‘science fiction’
title_pg = search_form.submit # capture submitted url and title_pg
contents
title_pg.links.each do |link|
puts link.text #not all the data is there
end
NOKOGIRI CODE
require ‘open-uri’
require ‘nokogiri’
url = “http://www.xyz.com/…”
doc = Nokogiri::HTML(open(url))
doc.xpath(‘//tr’).each do |node|
puts node.text
end