jruby/htmlUnit sample code

bruce · January 13, 2009, 6:02am

Hi.

Trying to find out if anyone has a sample jruby/HtmlUnit app that
scrapes
some data from an ajax webpage. I’d like to take a look at a live
app/script
to get a feel for how this kind of app works, what’s involved in setting
it
up, etc…

I haven’t been able to find a good sample on the 'net as of yet…

Thanks

-bruce

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

bruce · January 13, 2009, 7:38am

On Mon, Jan 12, 2009 at 9:01 PM, bruce [email protected] wrote:

Hi.

Trying to find out if anyone has a sample jruby/HtmlUnit app that scrapes
some data from an ajax webpage. I’d like to take a look at a live app/script
to get a feel for how this kind of app works, what’s involved in setting it
up, etc…

I haven’t been able to find a good sample on the 'net as of yet…

Could you give an example of what you mean by “scraping … an ajax
webpage?”

HtmlUnit is parsing the code that is sent to the browser. For Ajax
stuff you are depending on Rhino to interpret your javascript and
handle the XHRs the same way a browser would. This can work, but it
can also be messy. Rhino is well maintained and constantly improving.
Still, it isn’t “scraping” anything. It is emulating a browser, but no
actual browser is involved and there is no screen to scrape. If what
you are looking for is browser automation that allows you to actually
see what is happening you will have to use something like Selenium or
Watir.

If you are looking for examples of HtmlUnit, there are plenty of them
out there mostly in Java and JUnit. I am not aware of specific
examples in jruby, and I don’t have any myself. Generally, the Ruby
test tools are as good or better than the Java ones. So, once you are
using jruby there is no reason not to use them.

To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

bruce · January 13, 2009, 8:28am

Hi Adam.

Thanks for the reply. By screen scraping, I mean the ability to write an
app, to pull data out/from a site, based on the xpath/dom that I
specify.

In the case of using HtmlUnit/jruby, I’m under the impression based on
various sites, that I can somehow create an app/script, that allow me to
pull of data as well. The information that I’ve seen appears to imply
that
HtmlUnit is based on rhino, and has the ability to act as a headless
browser, allowing an app to more or less parse a page that uses
javascript/ajax. This would allow me to then extract the resulting data
based on the ajax function being run that drives the page content.

As an example, real world pages that I’m looking at are:
-USC Schedule of Classes
-302 Found

Each of these pages use javascript/ajax to dynamically generate content
for
the page in the given “div” sections… My goal is to be able to extract
this data…

thanks

-bruce
[email protected]
925-249-1844

bruce · January 13, 2009, 4:36pm

If I were you I would just cut out the middle man (the browser) and just
make the request directly to the URL that the browser would. From
Firebug I
gather that the Ajax request goes against:
http://web-app.usc.edu/ws/soc/api/departments/20083

If you just call that URL from a ruby program you will get your JSON
data
back directly. No need for screen scraping or HTMLUnit.

Joe