WWW::Mechanize with frames

Hi,

I’m trying to do some screen scraping from a site using frames. Using
WWW::Mechanize gives back an ‘error’ page from the site rather than the
data I wanted:

BRENDA: Entry of glyceraldehyde-3-phosphate dehydrogenase (phosphorylating)(EC-Number 1.2.1.12 )

Sorry, but your browser doesn't support frames. Please use another browser!

Does anyone have any experience of screen scraping from sites that use
frames? Am I missing something obvious?

Alex G.

AlexG wrote:

Hi,

I’m trying to do some screen scraping from a site using frames. Using
WWW::Mechanize gives back an ‘error’ page from the site rather than the
data I wanted:

This is the content of the frame page. It, in turn, fetches other pages
and loads them into its frames. Browsers that do not support frames see
the content in the noframes element.

If you want to snarf a framed page, you’ll need to treat each framed
items as the separate HTML pages that they are.

Here it appears to be the pages flat_navigation.php4?ecno=1.2.1.12 ,
flat_head.php4?ecno=1.2.1.12&organism= and
flat_result.php4?ecno=1.2.1.12&organism%5B%5D= .

You’ll need to supply the complete URL of course.

I do not think that Mechanize handles frames by default, but you could
teach it to grab the frame elements and parse the src attribute, then
construct the full URL.

James

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

On Wed, 2005-11-30 at 11:14 +0900, James B. wrote:

AlexG wrote:
I do not think that Mechanize handles frames by default, but you could
teach it to grab the frame elements and parse the src attribute, then
construct the full URL.

James

Having done a bit of this in the past, there may be quite a bit more to
it. You’ll notice that the original page claims that your client
doesn’t support frames. Often, it’ll base that decision off what
user-agent header you’re sending in the request.

Yeah, that sucks. It’s often guess and check with web pages these days.
There are still a lot of sites back in the “Are you Navigator 4?” era.
Very nice. You’ll see JavaScript that makes you want to stop whatever
you’re working on.

Bring on web services.

I don’t have the relevant codebase handy. Also, Mechanize looks great,
but Net::HTTP has still worked better in practice for me-- it handles
edge cases like these with slightly more panache.

Cheers,

Jim

Thanks for the reply. I tried to follow the links to the separate
frames but I end up back where I started, so I’m thinking there is
something more complicated going on at the server-end, perhaps do with
the PHPSESSID? I’m using Net::HTTP rather than WWW::Mechanize to try to
see what’s going on more clearly. I first ask for this page:

www.brenda.uni-koeln.de/php/result_flat.php4?ecno=1.1.1.1

If I load this on the browser it returns a page with three frames one
of which contains the info I want. In the script it returns this HTML:

BRENDA: Entry of alcohol dehydrogenase(EC-Number 1.1.1.1 )

Sorry, but your browser doesn't support frames. Please use another browser!

</noframes

The frame with name=“flat” is the one I want so I next make a query
using the same src as given for that frame:
‘/php/flat_result.php4?ecno=1.1.1.1&organism%5B%5D=&PHPSESSID=9643cd58a9774b07c768c97eeb4ef257’
in this case. I.e. the same as before but adding the PHPSESSID and
organism data.

Unfortunately the HTML returned from that query is identical to the
first so I’m back where I started. Interestingly, if I try to open the
single frame I want as a new tab in my browser it takes me back to the
three frame version (rather than that one frame as a single page as I
would expect). So it seems like the site is designed to make viewing
individual frames difficult.

The source for the frame I want includes this Javascript at the top:

I don’t know enough Javascript to know if this reveals anything, but it
seemed like it might mess with the way the browser loads the frame in
some way.

Any help grealty appreciated.

Jim Van F. wrote:

Having done a bit of this in the past, there may be quite a bit more to
it. You’ll notice that the original page claims that your client
doesn’t support frames. Often, it’ll base that decision off what
user-agent header you’re sending in the request.

Doubtful. I suppose that the content might have been dynamically
generated on the server but my experience with creating frames pages is
to that it is pretty standard to always include a noframes section for
browsers that do not know what to do with frames elements; the noframes
section is simply rendered by default by older browsers, while
frames-enabled browsers know to ignore it.

http://www.w3.org/TR/REC-html40/present/frames.html#h-16.4.1

Fetch the frames page in Firefox and do View Source and look at the
markup.

For example, this is a frames page:

http://www.adobe.com/svg/viewer/install/main.html

It renders fine in FF, which clearly knows how to handle frames, but
you’ll notice that the source HTML looks like this:

Web Center Features - SVG - Manual Download

Your Web browser does not support frames. Please click here.

James

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

AlexG wrote:

seemed like it might mess with the way the browser loads the frame in
some way.

Your browser is executing that code and reloading the fully-framed set.
Disable JavaScript in your browser and see what happens.

I was able to fetch the single ‘flat’ page using wget, and Mechanize or
open-uri should be able to do so also, since they will not be executing
any JavaScript.

James

http://www.ruby-doc.org - Ruby Help & Documentation
Ruby Code & Style - Ruby Code & Style: Writers wanted
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys
http://www.30secondrule.com - Building Better Tools

Thanks James,

I finally worked out what was going wrong. The original page returned
is ‘result_flat.php’ while the frame I wanted is ‘flat_result.php’. I
hadn’t noticed the swap when I was writing the script but it’s working
now. Sorry for wasting your time.

Alex