Forum: JRuby jruby/htmlUnit sample code..

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Ed80b40c710b2e6322502e59b4715751?d=identicon&s=25 bruce (Guest)
on 2009-01-13 06:02
(Received via mailing list)
Hi.

Trying to find out if anyone has a sample jruby/HtmlUnit app that
scrapes
some data from an ajax webpage. I'd like to take a look at a live
app/script
to get a feel for how this kind of app works, what's involved in setting
it
up, etc...

I haven't been able to find a good sample on the 'net as of yet...

Thanks

-bruce


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email
E0737dd2ec32558e989ff48536039f73?d=identicon&s=25 Adam Sroka (Guest)
on 2009-01-13 07:38
(Received via mailing list)
On Mon, Jan 12, 2009 at 9:01 PM, bruce <bedouglas@earthlink.net> wrote:
> Hi.
>
> Trying to find out if anyone has a sample jruby/HtmlUnit app that scrapes
> some data from an ajax webpage. I'd like to take a look at a live app/script
> to get a feel for how this kind of app works, what's involved in setting it
> up, etc...
>
> I haven't been able to find a good sample on the 'net as of yet...
>

Could you give an example of what you mean by "scraping ... an ajax
webpage?"

HtmlUnit is parsing the code that is sent to the browser. For Ajax
stuff you are depending on Rhino to interpret your javascript and
handle the XHRs the same way a browser would. This can work, but it
can also be messy. Rhino is well maintained and constantly improving.
Still, it isn't "scraping" anything. It is emulating a browser, but no
actual browser is involved and there is no screen to scrape. If what
you are looking for is browser automation that allows you to actually
see what is happening you will have to use something like Selenium or
Watir.

If you are looking for examples of HtmlUnit, there are plenty of them
out there mostly in Java and JUnit. I am not aware of specific
examples in jruby, and I don't have any myself. Generally, the Ruby
test tools are as good or better than the Java ones. So, once you are
using jruby there is no reason not to use them.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email
Ed80b40c710b2e6322502e59b4715751?d=identicon&s=25 bruce (Guest)
on 2009-01-13 08:28
(Received via mailing list)
Hi Adam.

Thanks for the reply. By screen scraping, I mean the ability to write an
app, to pull data out/from a site, based on the xpath/dom that I
specify.

In the case of using HtmlUnit/jruby, I'm under the impression based on
various sites, that I can somehow create an app/script, that allow me to
pull of data as well. The information that I've seen appears to imply
that
HtmlUnit is based on rhino, and  has the ability to act as a headless
browser, allowing an app to more or less parse a page that uses
javascript/ajax. This would allow me to then extract the resulting data
based on the ajax function being run that drives the page content.

As an example, real world pages that I'm looking at are:
 -http://web-app.usc.edu/soc
 -http://web-app.usc.edu/soc/term_20083.html

Each of these pages use javascript/ajax to dynamically generate content
for
the page in the given "div" sections... My goal is to be able to extract
this data...

thanks

-bruce
bedouglas@earthlink.net
925-249-1844
22785d4dbf585723bf60458ece0170e1?d=identicon&s=25 Joseph Athman (Guest)
on 2009-01-13 16:36
(Received via mailing list)
If I were you I would just cut out the middle man (the browser) and just
make the request directly to the URL that the browser would.  From
Firebug I
gather that the Ajax request goes against:
http://web-app.usc.edu/ws/soc/api/departments/20083

If you just call that URL from a ruby program you will get your JSON
data
back directly.  No need for screen scraping or HTMLUnit.

Joe
This topic is locked and can not be replied to.