URL paramater sts - mechanize & nokogiri differences

rustysam · October 9, 2010, 4:11pm

I have written ruby code (with mechanize and nokogiri) to do the
following

Retrieve the search webpage
Enter search criteria into the from
Submit the form and retrieve the first webpage which is a list of
book titles embedded in the page
For each title in the retrieved web page extract 5 fields
Retrieve the next webpage of titles
Repeat 4 & 5 until all titles retrieved

The mechanize code below works to the point of submitting the form. The
first webpage returned is missing at least 2 of the
Fields for each title.

Now if I grab the url generated by mech.submit and use it in firefox it
displays all the titles and information normally BUT
the URL has been changed slightly before the titles are displayed.

THIS IS THE URL RETURNED BY MECHANIZE.SUBMIT
#<URI::HTTP:0x17706d8
URL:.xyz Domain Names | Join Generation XYZ>}

Now if I take the URL from the submit and use it in the nokogiri code
below it fails to open with BAD URI.
Also if take the URL from fire fox and use it in the nokogiri code
below it also fails to open with BAD URI.

Now if I start off in firefox at the search page and enter the same data
into the form and submit it manually I wind up with the
same screen displayed as when I cut and pasted in the url from the
mechanize.submit code.

If I now copy the url from firefox and use it in the nokogiri code below
it works fine and the “puts node.text” shows that
all 5 of the fields I require are there (plus others not present in the
mechanize object)

Now the urls from the 3 steps above only differ in one way, the last
variable (sts) on the url line.
&sortby=17&sts=t>}" from mechanize.submit
&sortby=17&sts=t%3E}" copied from firefox after submit url used and
webpage displayed (changed url)
&sortby=17&sts=t&x=84&y=10" manualy entered the search and this is
the url upon display of first page

The attached file shows what the source (from web page) for the last
title looks like and what the mechanize content for that same title
looks like.

THE CONTENTS OF BOTH
AND

are missing in the mechanize object

Can anyone shed light on what is happening. It would be greatly
appreciated.
Thanks Don

#MECHANIZE CODE
require ‘rubygems’
require ‘open-uri’
require ‘nokogiri’
require ‘mechanize’
url = “…” # url of search form
a = Mechanize.new { |agent|
agent.user_agent_alias = ‘Mac Safari’;
};
search_page = a.get(url);
search_form = search_page.form_with(:name => ‘form-advancedSearch’)
search_form.an = ‘Asimov’
search_form.kn = ‘science fiction’
title_pg = search_form.submit # capture submitted url and title_pg
contents
title_pg.links.each do |link|
puts link.text #not all the data is there
end

NOKOGIRI CODE
require ‘open-uri’
require ‘nokogiri’
url = “http://www.xyz.com/…”

doc = Nokogiri::HTML(open(url))
doc.xpath(‘//tr’).each do |node|
puts node.text
end

rustysam · October 10, 2010, 1:04am

I still have not resolved (or do not understand my problem) but the
following is a work around that allows me to continue with development

title_pg = search_form.submit # get first title page - last line of
orig code

#initialize a Nokogiri::HTML Object with ‘title_pg.body’ the returned
web page
doc = Nokogiri::HTML(title_pg.body)

can now use Nokogiri to process the title page HTML
doc.xpath(’//tr’).each do |node|
puts node.text
end

This prints out the fields that are missing in the mechanize object.
Not sure if this is really is a problem or I simply do not understand
the mechanize object properly and the data is there but requires a
different selector??