Mechanize for BIG website scrapping

Horacio_S · September 21, 2006, 6:24am

I am using Mechanize for several projects that require me to download
large
amounts of html pages from a web site. Since I am working with about a
1000
pages the limitations of mechanize started to appear…

Try this code

################################################
require ‘rubygems’
require ‘mechanize’

agent = WWW::Mechanize.new

prev = 0
curr = 0
prev_pages = 0
curr_pages = 0

1000.times do
page = agent.get(“http://yourfavoritepage.com”)
curr = 0
curr_pages = 0
# Count the total number of objects and the number of
WWW::Mechanize::Page
# objects.
ObjectSpace.each_object { |o|
curr += 1
curr_pages += 1 if o.class == WWW::Mechanize::Page
}
puts “There are #{curr} (#{curr - prev}) objects”
puts “There are #{curr_pages} (#{curr_pages - prev_pages}) objects”
prev = curr
prev_pages = curr_pages
GC.enable
GC.start
sleep 1.0 # This avoids the script of taking 100% CPU
end

############################################

The output of this script repeals that at each iteration a
WWW::Mechanize::Page object gets created (along with a lot of other
objects)
and they never get GarbageCollected. So you can see your RAM flying away
in
each iteration and never returning back.

Now this can be solved by putting the agent = WWW::Mechanize.new inside
the
block like:

############################################

1000.times do
agent = WWW::Mechanize.new ← CHANGE IS HERE
page = agent.get(“http://yourfavoritepage.com”)
curr = 0
curr_pages = 0
# Count the total number of objects and the number of
WWW::Mechanize::Page

..... the rest is the same

#############################################

With this change we see that the max number of WWW::Mechanize::Page
objects
never increases more then three and the other objects increase and
decrease
in the order of 60 per iteration.

Does this means that the WWW::Mechanize object keeps references of all
the
pages downloaded?? and those pages are not gonna be GarbageCollected
until
the WWW::Mechanize object is alive?

In my script I cannot remove the WWW::Mechanize object since this page
in
particular is a form and requires cookies state information to be able
to
access to the pages I need to download. Is there a way to tell the
Mechanize
Object to delete the pages alreade downloaded??

regards,
Horacio

Horacio_S · September 21, 2006, 7:04am

On Sep 20, 2006, at 9:24pm, Horacio S. wrote:

In my script I cannot remove the WWW::Mechanize object since this
page in
particular is a form and requires cookies state information to be
able to
access to the pages I need to download.

What if you save the cookies out to a file?
WWW::Mechanize::CookieJar has a #save_as and #load method to save and
restore cookies.

Is there a way to tell the Mechanize Object to delete the pages
alreade downloaded??

I actually ran into a similar issue recently; your diagnosis explains
why my program used too much memory.

You might try the following (assuming “browser” is your
WWW::Mechanize object):

browser.page.content.replace ""	# that's an empty string
browser.page.root.children = []

That should clear both the original text, and the parsed HTML. I’m
not sure whether this would get rid of all the references, but at
least it should help.

–John

Horacio_S · September 21, 2006, 8:08am

Thanks for your answer but I found how to fix this problem.

A look at the Mechanize code reveals that each page loaded is stored in
a
history hash inside the Mechanize object. This means that as long as the
Mechanize object exist the pages will never go away.

Solution?? simply set the history_max value to something more coherent
than
infinite.

############################
agent = WWW::Mechanize.new
agent.history_max = 10
############################

and that’s it… no more memory hungry Mechanize.

I noted that setting this value to zero gives some problems when
submiting
forms. So don’t set it up to zero. Even one seems to work ok.

Hope this helps,

Horacio

æ?¨æ??æ?¥ 21 9æ?? 2006 14:03ã?John L. ã?ã??ã¯æ?¸ãã¾ã?ã?: