Trying to find out if anyone has a sample jruby/HtmlUnit app that
scrapes
some data from an ajax webpage. I’d like to take a look at a live
app/script
to get a feel for how this kind of app works, what’s involved in setting
it
up, etc…
I haven’t been able to find a good sample on the 'net as of yet…
Trying to find out if anyone has a sample jruby/HtmlUnit app that scrapes
some data from an ajax webpage. I’d like to take a look at a live app/script
to get a feel for how this kind of app works, what’s involved in setting it
up, etc…
I haven’t been able to find a good sample on the 'net as of yet…
Could you give an example of what you mean by “scraping … an ajax
webpage?”
HtmlUnit is parsing the code that is sent to the browser. For Ajax
stuff you are depending on Rhino to interpret your javascript and
handle the XHRs the same way a browser would. This can work, but it
can also be messy. Rhino is well maintained and constantly improving.
Still, it isn’t “scraping” anything. It is emulating a browser, but no
actual browser is involved and there is no screen to scrape. If what
you are looking for is browser automation that allows you to actually
see what is happening you will have to use something like Selenium or
Watir.
If you are looking for examples of HtmlUnit, there are plenty of them
out there mostly in Java and JUnit. I am not aware of specific
examples in jruby, and I don’t have any myself. Generally, the Ruby
test tools are as good or better than the Java ones. So, once you are
using jruby there is no reason not to use them.
Thanks for the reply. By screen scraping, I mean the ability to write an
app, to pull data out/from a site, based on the xpath/dom that I
specify.
In the case of using HtmlUnit/jruby, I’m under the impression based on
various sites, that I can somehow create an app/script, that allow me to
pull of data as well. The information that I’ve seen appears to imply
that
HtmlUnit is based on rhino, and has the ability to act as a headless
browser, allowing an app to more or less parse a page that uses
javascript/ajax. This would allow me to then extract the resulting data
based on the ajax function being run that drives the page content.
Each of these pages use javascript/ajax to dynamically generate content
for
the page in the given “div” sections… My goal is to be able to extract
this data…
If I were you I would just cut out the middle man (the browser) and just
make the request directly to the URL that the browser would. From
Firebug I
gather that the Ajax request goes against: http://web-app.usc.edu/ws/soc/api/departments/20083
If you just call that URL from a ruby program you will get your JSON
data
back directly. No need for screen scraping or HTMLUnit.
Joe
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.