Emulating a web browser

I am looking for a library to help me emulate a web browser, at least at
the
network level. By this I mean I would like to run a program that, from
the
point of view of a web server, behaves just like, say, Firefox, but I
don’t
care about actually displaying text or images or anything like that.
What I
would like it to do is speak HTTP, store and send cookies, automatically
fetch embedded content like images and style sheets, and so forth. I
thought Mechanize was what I wanted, but it doesn’t fetch embedded
content.
It doesn’t even recognize it. I could perhaps tell Nokogiri to find all
the
images and have Mechanize fetch them, but I’ve never used Nokogiri
before, I
don’t know an exhaustive list of types of embedded content Firefox loads
automatically (images, JavaScript, Flash, anything else?), and it seems
like
getting Mechanize to emulate FF’s HTTP request for these objects is
difficult.

Are there libraries that are meant for this type of interaction with
websites? Perhaps I’m better off abandoning Ruby and making a Firefox
extension.

Thanks,

Adam

On Thu, 2009-04-30 at 16:54 +0900, Adam B. wrote:

automatically (images, JavaScript, Flash, anything else?), and it seems like
getting Mechanize to emulate FF’s HTTP request for these objects is
difficult.

Are there libraries that are meant for this type of interaction with
websites? Perhaps I’m better off abandoning Ruby and making a Firefox
extension.

I’m not sure what you want to do, but have you looked at Watir?
http://wtr.rubyforge.org/

Also, I’ve found Nokogiri to fail on a few things, most notably
maps.google.com. If Nokogiri is able to work your site though, then it
is
definitely a lot, lot faster than Hpricot.

Jayanth

Adam B. wrote:

I am looking for a library to help me emulate a web browser, at least at
the
network level. By this I mean I would like to run a program that, from
the
point of view of a web server, behaves just like, say, Firefox, but I
don’t
care about actually displaying text or images or anything like that.
What I
would like it to do is speak HTTP, store and send cookies, automatically
fetch embedded content like images and style sheets, and so forth. I
thought Mechanize was what I wanted, but it doesn’t fetch embedded
content.
It doesn’t even recognize it. I could perhaps tell Nokogiri to find all
the
images and have Mechanize fetch them, but I’ve never used Nokogiri
before, I
don’t know an exhaustive list of types of embedded content Firefox loads
automatically (images, JavaScript, Flash, anything else?), and it seems
like
getting Mechanize to emulate FF’s HTTP request for these objects is
difficult.

Are there libraries that are meant for this type of interaction with
websites? Perhaps I’m better off abandoning Ruby and making a Firefox
extension.

Thanks,

Adam

For static content, like images, stylesheets, js files, etc. all you
need is an html parser. hpricot is an html parser with good docs (I
can’t find many examples for nokogiri but it uses the same syntax as
hpricot for searching a document):

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’

doc = Hpricot(open(“http://blog.segment7.net/”))

#images:
imgs = doc.search(“img”)
puts imgs[0][:src]

#stylesheets:
css = doc.search(‘//link[@type=“text/css”]’)
puts css[0][:href]

#javascript:
js = doc.search(‘//script[@type=“text/javascript”]’)
puts js[0][:src]

–output:–
/images/spinner-blue.gif?1140249801
http://segment7.net/styles/s7.css
/javascripts/cookies.js?1142467953

http://wiki.github.com/why/hpricot

Check out both Hpricot Basics and Hpricot Challenge for lots of
examples.

I don’t think there are programs yet that can produce the page that the
user sees after javascript executes in a browser and does dynamic html
replacements. I know they are trying to write them.

As for cookies, dealing with them usually goes hand in hand with filling
out forms, so you could use mechanize for that. Also, mechanize
incorporates nokogiri, so you can use mechanize as an html parser to
search for the same things I did with hpricot:

require ‘rubygems’
require ‘mechanize’

agent = WWW::Mechanize.new
page = agent.get(“http://blog.segment7.net/”)
css = page.search(‘//link[@type=“text/css”]’)
puts css[0][:href]

–output:–
http://segment7.net/styles/s7.css

On Thu, 2009-04-30 at 16:54 +0900, Adam B. wrote:

I am looking for a library to help me emulate a web browser, at least at
the
network level. By this I mean I would like to run a program that, from
the
point of view of a web server, behaves just like, say, Firefox, but I
don’t
care about actually displaying text or images or anything like that.
What I
would like it to do is speak HTTP, store and send cookies, automatically
fetch embedded content like images and style sheets, and so forth.

Take a look at Celerity.
http://celerity.rubyforge.org/

Bret


Bret P.
CTO, WatirCraft LLC, www.watircraft.com
Lead Developer, Watir, www.watir.com

Blog, www.io.com/~wazmo/blog
Twitter, www.twitter.com/bpettichord
GTalk: [email protected]

Ask Me About Watir Training
www.watircraft.com/training