Forum: Ruby WWW::Mechanize with frames

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
9ba852bc58ecf0ef02897497a13a8288?d=identicon&s=25 alexg (Guest)
on 2005-11-30 02:16
(Received via mailing list)
Hi,

I'm trying to do some screen scraping from a site using frames. Using
WWW::Mechanize gives back an 'error' page from the site rather than the
data I wanted:

<html>
<head>
<title>BRENDA: Entry of glyceraldehyde-3-phosphate dehydrogenase
(phosphorylating)(EC-Number 1.2.1.12 )</title>
</head>
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Frameset//EN\"
\"http://www.w3.org/TR/html4/frameset.dtd\">
<frameset cols="190,*" border="0">
<frame name="navigation" src="flat_navigation.php4?ecno=1.2.1.12"
frameborder="no">
<frameset rows="110,*" border="0">
<frame name="header" src="flat_head.php4?ecno=1.2.1.12&organism="
frameborder="no">
<frame name="flat" src="flat_result.php4?ecno=1.2.1.12&organism%5B%5D="
frameborder="no">
</frameset>
</frameset>
<noframes><h2>Sorry, but your browser doesn't support frames. Please
use another browser!</h2>
</noframes>
</html>

Does anyone have any experience of screen scraping from sites that use
frames? Am I missing something obvious?

Alex Gutteridge
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 james_b (Guest)
on 2005-11-30 03:16
(Received via mailing list)
AlexG wrote:
> Hi,
>
> I'm trying to do some screen scraping from a site using frames. Using
> WWW::Mechanize gives back an 'error' page from the site rather than the
> data I wanted:
>

This is the content of the frame page.  It, in turn, fetches other pages
and loads them into its frames.  Browsers that do not support frames see
the content in the noframes element.

If you want to snarf a framed page, you'll need to treat each framed
items as the separate HTML pages that they are.

Here it appears to be the pages flat_navigation.php4?ecno=1.2.1.12 ,
flat_head.php4?ecno=1.2.1.12&organism=  and
flat_result.php4?ecno=1.2.1.12&organism%5B%5D= .

You'll  need to supply the complete URL of course.

I do not think that Mechanize handles frames by default, but you could
teach it to grab the frame elements and parse the src attribute, then
construct the full  URL.

James
--

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
2d29e2e4020ccfe394878c314592ac7a?d=identicon&s=25 jim (Guest)
on 2005-11-30 04:25
(Received via mailing list)
On Wed, 2005-11-30 at 11:14 +0900, James Britt wrote:
> AlexG wrote:
> I do not think that Mechanize handles frames by default, but you could
> teach it to grab the frame elements and parse the src attribute, then
> construct the full  URL.
>
> James

Having done a bit of this in the past, there may be quite a bit more to
it.  You'll notice that the original page claims that your client
doesn't support frames.  Often, it'll base that decision off what
user-agent header you're sending in the request.

Yeah, that sucks.  It's often guess and check with web pages these days.
There are still a lot of sites back in the "Are you Navigator 4?" era.
Very nice.  You'll see JavaScript that makes you want to stop whatever
you're working on.

Bring on web services.

I don't have the relevant codebase handy.  Also, Mechanize looks great,
but Net::HTTP has still worked better in practice for me-- it handles
edge cases like these with slightly more panache.



Cheers,

Jim
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 james_b (Guest)
on 2005-11-30 05:49
(Received via mailing list)
Jim Van Fleet wrote:
> Having done a bit of this in the past, there may be quite a bit more to
> it.  You'll notice that the original page claims that your client
> doesn't support frames.  Often, it'll base that decision off what
> user-agent header you're sending in the request.

Doubtful.  I suppose that the content might have been dynamically
generated on the server but my experience with creating frames pages is
to that it is pretty standard to always include a noframes section for
browsers that do not know what to do with frames elements; the noframes
section is simply rendered by default by older browsers, while
frames-enabled browsers know to ignore it.

http://www.w3.org/TR/REC-html40/present/frames.html#h-16.4.1

Fetch the frames page in Firefox and do View Source and look at the
markup.

For example, this is a frames page:

http://www.adobe.com/svg/viewer/install/main.html

It renders fine in FF, which clearly knows how to handle frames, but
you'll notice that the source HTML looks like this:

<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>Web Center Features - SVG - Manual Download</TITLE>
<!-- Use of frames in this way is required to work around a bug on   -->
<!-- Netscape's "SmartUpdate" feature on the Mac, where by default   -->
<!-- the pluginspage attribute on the embed tag causes that browser  -->
<!-- to open a non-scrollable non-resizable window that's too small. -->
</HEAD>
<FRAMESET cols="1,*">
	<FRAME SRC="blank.html" NAME="blank" SCROLLING="NO" FRAMEBORDER="NO">
	<FRAME SRC="mainframed.html" NAME="main" SCROLLING="YES"
FRAMEBORDER="NO">
</FRAMESET>
<NOFRAMES>

Your Web browser does not support frames. Please <A
HREF="mainframed.html">click here</A>.
</NOFRAMES>
</HTML>




James

--

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
9ba852bc58ecf0ef02897497a13a8288?d=identicon&s=25 alexg (Guest)
on 2005-12-01 01:49
(Received via mailing list)
Thanks for the reply. I tried to follow the links to the separate
frames but I end up back where I started, so I'm thinking there is
something more complicated going on at the server-end, perhaps do with
the PHPSESSID? I'm using Net::HTTP rather than WWW::Mechanize to try to
see what's going on more clearly. I first ask for this page:

www.brenda.uni-koeln.de/php/result_flat.php4?ecno=1.1.1.1

If I load this on the browser it returns a page with three frames one
of which contains the info I want. In the script it returns this HTML:

<html>
<head>
<title>
BRENDA: Entry of alcohol dehydrogenase(EC-Number 1.1.1.1 )
</title>
</head>
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Frameset//EN\"
\"http://www.w3.org/TR/html4/frameset.dtd\">
<frameset cols="190,*" border="0">
<frame name="navigation"
src="flat_navigation.php4ecno=1.1.1.1&PHPSESSID=9643cd58a9774b07c768c97eeb4ef257"
frameborder="no">
<frameset rows="110,*" border="0">
<frame name="header"
src="flat_head.php4ecno=1.1.1.1&organism=&PHPSESSID=9643cd58a9774b07c768c97eeb4ef257"
frameborder="no">
<frame name="flat"
src="flat_result.php4?ecno=1.1.1.1&organism%5B%5D=&PHPSESSID=9643cd58a9774b07c768c97eeb4ef257"
frameborder="no">
</frameset>
</frameset>
<noframes><h2>Sorry, but your browser doesn't support frames. Please
use another browser!</h2></noframes
</html>

The frame with name="flat" is the one I want so I next make a query
using the same src as given for that frame:
'/php/flat_result.php4?ecno=1.1.1.1&organism%5B%5D=&PHPSESSID=9643cd58a9774b07c768c97eeb4ef257'
in this case. I.e. the same as before but adding the PHPSESSID and
organism data.

Unfortunately the HTML returned from that query is identical to the
first so I'm back where I started. Interestingly, if I try to open the
single frame I want as a new tab in my browser it takes me back to the
three frame version (rather than that one frame as a single page as I
would expect). So it seems like the site is designed to make viewing
individual frames difficult.

The source for the frame I want includes this Javascript at the top:

<script language="JavaScript">
<!--
if (parent.location.href == self.location.href)
    window.location.href = 'result_flat.php4?ecno=1.1.1.1';
//-->
</script>

I don't know enough Javascript to know if this reveals anything, but it
seemed like it might mess with the way the browser loads the frame in
some way.

Any help grealty appreciated.
Bc6d88907ce09158581fbb9b469a35a3?d=identicon&s=25 james_b (Guest)
on 2005-12-01 02:13
(Received via mailing list)
AlexG wrote:
...

> seemed like it might mess with the way the browser loads the frame in
> some way.

Your browser is executing that code and reloading the fully-framed set.
Disable JavaScript in your browser and see what happens.

I was able to fetch the single 'flat' page using wget, and Mechanize or
open-uri should be able to do so also, since they will not be executing
any JavaScript.

James

--

http://www.ruby-doc.org       - Ruby Help & Documentation
http://www.artima.com/rubycs/ - Ruby Code & Style: Writers wanted
http://www.rubystuff.com      - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com     - Playing with Better Toys
http://www.30secondrule.com   - Building Better Tools
9ba852bc58ecf0ef02897497a13a8288?d=identicon&s=25 alexg (Guest)
on 2005-12-01 04:10
(Received via mailing list)
Thanks James,

I finally worked out what was going wrong. The original page returned
is 'result_flat.php' while the frame I wanted is 'flat_result.php'. I
hadn't noticed the swap when I was writing the script but it's working
now. Sorry for wasting your time.

Alex
This topic is locked and can not be replied to.