Webcrawler that become enormous

Hi,
I’ve made a script in Ruby 1.8 that use the gems mechanize, nokogiri and
open-uri and run under Linux.

This script is a webcrawler that scan a bigger site that contains a big
amount of data of international firm.

I’m interesting in create a db with only some data and not the full
data.
Mine script run perfectly and grab all data in the correct order but
after 6-8 hour that the script run the amount of memory that use is
enormous (1gb).

I save in a file the results of scraping and empty the buffer of data
every 10 firm collect.

I’ve follow this post ofr obtain this results because before the script
used this amount of memory just after 4hour.

Someone can help me to reduce this problem and optimize this script?
Exist an IDE that make an efficient debug for ruby?
I think that there is something that I’ve missed.

Thanks
Luca

On Thu, Nov 17, 2011 at 5:14 PM, Lucas P. [email protected] wrote:

after 6-8 hour that the script run the amount of memory that use is
enormous (1gb).

I save in a file the results of scraping and empty the buffer of data
every 10 firm collect.

It seems either you do not free the memory (and thus have created a
leak yourself) or you suffer from the mentioned bug.

I’ve follow this post ofr obtain this results because before the script
used this amount of memory just after 4hour.
Ruby Memory Management - Stack Overflow

Someone can help me to reduce this problem and optimize this script?
Exist an IDE that make an efficient debug for ruby?
I think that there is something that I’ve missed.

First thing I’d do is to update Ruby version to a more recent 1.9.*
version. That will be faster also and likely has a fix for the
leakage bug mentioned on the stackoverflow page. If your problem
persists, you need to look into your code.

A simple test would be to write out statistics per class on a regular
basis, e.g.

cnt = Hash.new 0
ObjectSpace.each_object(BasicObject) {|o| cnt[o.class] += 1}
cnt.sort_by {|cl,c| cl.to_s}.each {|cl,c| printf “%10d %s\n”, c, cl}

Then compare counts per class. Of course, you can get a bit more
fancy and calculate deltas etc. But then there are better tools
around, I guess.

Kind regards

robert