Scrape javascript content

I’m trying to scrape a page that hides some data behind a javascript
function. Is there any way to get this data? I’ve been using
Mechanize, but I’m not sure it can do this. Is there a better library
to use for this type of thing?

The following is the interesting part of the page:

 

On Thu, May 20, 2010 at 12:48 AM, Phil Mcdonnell
<[email protected]

wrote:

Posted via http://www.ruby-forum.com/.

You might check out Harmony:

http://www.rubyinside.com/harmony-javascript-and-a-dom-environment-in-ruby-3001.html
http://rubygems.org/gems/harmony
http://github.com/mynyml/harmony

The other trick here is that this page is behind a login. Mechanize
allows me to fill out the login form and holds onto the login
credentials for me. Can harmony/celebrity/watir do this?

The really interesting part is what does the Javascript do :slight_smile: with
(a potentially large) effort you may be able to “reverse-engineer” the
javascript and emulate manually in mechanize. I.e. if the javascript
builds a simple HTTP request, you may be able to send the same request
from mechanize (possibly) without much effort.

How would one do this? I’m somewhat new to javascript as I usually
don’t do front end engineering. I see the below definition of this
function in the HTML page. Any way I can sniff out what it’s actually
doing? I’m looking to figure out what the fireClick method displays.

<script type="text/javascript">
  var d = document.domain.split(".");
  document.domain = d[d.length - 2] + "." + d[d.length - 1];
  var start = (new Date()).getTime();
  var fireClick = function(){};
  var omn_hierarchy="US|AMEX|Ser|eStatement";
  var omn_pagename="MainPage";
  var omn_language="en";
  var omn_newpagename="yes";
</script>

… way down below…

 

On Thu, May 20, 2010 at 1:48 AM, Phil Mcdonnell
[email protected] wrote:

I’m trying to scrape a page that hides some data behind a javascript
function. Is there any way to get this data? I’ve been using
Mechanize, but I’m not sure it can do this. Is there a better library
to use for this type of thing?

http://celerity.rubyforge.org/
http://watir.com/

The following is the interesting part of the page:

 

The really interesting part is what does the Javascript do :slight_smile: with
(a potentially large) effort you may be able to “reverse-engineer” the
javascript and emulate manually in mechanize. I.e. if the javascript
builds a simple HTTP request, you may be able to send the same request
from mechanize (possibly) without much effort.

On Fri, May 21, 2010 at 1:14 AM, Phil Mcdonnell
[email protected] wrote:

The other trick here is that this page is behind a login. Â Mechanize
allows me to fill out the login form and holds onto the login
credentials for me. Â Can harmony/celebrity/watir do this?

Watir definitely does that since it simply controls your browser and
therefore behaves exactly like one.

Mechanize cannot execute javascript but watir/celerity can. (I’ve never
used harmony)

#in watir (could also use firewatir and/or the safari equivalent)
require ‘watir’
require ‘watir/ie’

should work identically with celerity

#require 'celerity
#@browser = Celerity::IE.new

@login_page = ‘http://example.com/

@browser = Watir::IE.new
@browser.goto @login_page
@browser.text_field(:name, ‘username’).set(@user)
@browser.text_field(:name, ‘password’).set(@pass)
@browser.button(:value, “LogIn”).click

go to page where the javascript link is

@broswer.link(:text, “Link Name”).click

click it

this assumes the fireClick event is ‘just’ an ajax call which returns

content
@broswer.link(:id, “iroc_0”).click
@browser.wait # wait for ajax to return

show page’s displaying text (not view source)

puts @browser.text

if above fires a pop up window more code is needed to retrieve the

content

Mechanize cannot execute javascript but watir/celerity can. (I’ve never
used harmony)

Harmony uses envjs to execute JavaScript. There’s also capybara which
can either use a browser or envjs.

This is extremely helpful!

With Watir I’m running into a problem finding the image button for login
on the following page:
https://online.americanexpress.com/myca/logon/us/action?request_type=LogonHandler&Face=en_US&DestPage=https%3A%2F%2Fwww99.americanexpress.com%2Fmyca%2Facctsumm%2Fus%2Faction%3Frequest_type%3Dauthreg_acctAccountSummary%26us_nu%3Dlogincontrol

It looks like the login button is just a clickable image and I should be
able to find it via:
browser.button(:alt, “Login”).click

Any idea why that doesn’t find the button?

David W. wrote:

Mechanize cannot execute javascript but watir/celerity can. (I’ve never
used harmony)

#in watir (could also use firewatir and/or the safari equivalent)
require ‘watir’
require ‘watir/ie’

should work identically with celerity

#require 'celerity
#@browser = Celerity::IE.new

@login_page = ‘http://example.com/

@browser = Watir::IE.new
@browser.goto @login_page
@browser.text_field(:name, ‘username’).set(@user)
@browser.text_field(:name, ‘password’).set(@pass)
@browser.button(:value, “LogIn”).click

go to page where the javascript link is

@broswer.link(:text, “Link Name”).click

click it

this assumes the fireClick event is ‘just’ an ajax call which returns

content
@broswer.link(:id, “iroc_0”).click
@browser.wait # wait for ajax to return

show page’s displaying text (not view source)

puts @browser.text

if above fires a pop up window more code is needed to retrieve the

content

On May 24, 10:32 am, [email protected] wrote:

Any idea why that doesn’t find the button?

Sorry, don’t have time to look at the page right now, but if it “is
just a clickable image” and not an actual “button” watir’s button
helper may not find it (even though it looks like a button) so try
browser.image().click?

To click on this with Watir:
You can use:

@browser.button(:src, ‘https://online.americanexpress.com/myca/logon/
us/shared/images/btn_login.gif’).click

This was captured using the Webmetrics script recorder
http://www.webmetrics.com/products/script_recorder.html
It has a Watir compatible mode. You won’t get a working
script out of it but it good for identifying objects.

Inspect Element using FireBug:

A nice helper tool for identify page object such as this Webmetrics

Good luck,
Darryl

On Mon, May 24, 2010 at 3:36 AM, Phil Mcdonnell
[email protected] wrote:

With Watir I’m running into a problem finding the image button for login
on the following page:
https://online.americanexpress.com/myca/logon/us/action?request_type=LogonHandler&Face=en_US&DestPage=https%3A%2F%2Fwww99.americanexpress.com%2Fmyca%2Facctsumm%2Fus%2Faction%3Frequest_type%3Dauthreg_acctAccountSummary%26us_nu%3Dlogincontrol

It looks like the login button is just a clickable image and I should be
able to find it via:
browser.button(:alt, “Login”).click

Any idea why that doesn’t find the button?

Sorry, don’t have time to look at the page right now, but if it “is
just a clickable image” and not an actual “button” watir’s button
helper may not find it (even though it looks like a button) so try
browser.image().click?

Darryl! You just made my day! This does work. I’ve been banging my
head on the wall for a while here :slight_smile: I had tried looking for the src
tag too, but not with the full path (only the referential path in the
html).

Thank you!

Darryl Brown wrote:

To click on this with Watir:
You can use:

@browser.button(:src, ‘https://online.americanexpress.com/myca/logon/
us/shared/images/btn_login.gif’).click

This was captured using the Webmetrics script recorder
http://www.webmetrics.com/products/script_recorder.html
It has a Watir compatible mode. You won’t get a working
script out of it but it good for identifying objects.

Inspect Element using FireBug:

A nice helper tool for identify page object such as this Webmetrics

Good luck,
Darryl