Memory leak

Hi all,
I’m writing web crawler with threads support. And after working with
some amount of link memory usage increase more and more. When program
just started in use 20mb of mem. After crawling of 150-200 link, memory
usage ~100. When 1000 link crawled my program may use up to 1GB of mem.
Help me please find out why?

require ‘rubygems’
require ‘mechanize’
require ‘hpricot’
require ‘yaml’
require ‘net/http’
require ‘uri’
require ‘modules/common’

Thread.abort_on_exception = true
$config = YAML.load_file “config.yml”
links = IO.read(“bases/3+4.txt”).split("\n")
threads = []

links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end

threads.each { |t| t.join() }

What in “Common” module:

  1. Crawler (net/http or mechanize - I tried both, results the same)
  2. HTML parser (Hpricot or Nokogir - I tried both again, with same bad
    results)
    so I extract some data from page and save it to the file. Nothing
    special as you see.

When I start this program without threads I getting the same results :frowning:

Please help, is this my fault or something wrong in the libraries ?

Rob D. wrote:

require ‘yaml’
if Thread.list.size < 50 then
puts "total threads: " + Thread.list.size.to_s
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results :frowning:

Please help, is this my fault or something wrong in the libraries ?

One culprit could be mechanize. Mechanize keeps a history of the pages
you’ve visited, and if you don’t limit that length it will grow
infinitely.

Try setting the max history:

http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

One culprit could be mechanize. Mechanize keeps a history of the pages
you’ve visited, and if you don’t limit that length it will grow
infinitely.

Try setting the max history:

http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

Thanks, but I’ve already try this… mem leak happens even when I use
“net/http” with Hpricot

Rob D. wrote:

One culprit could be mechanize. Mechanize keeps a history of the pages
you’ve visited, and if you don’t limit that length it will grow
infinitely.

Try setting the max history:

http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000173

Thanks, but I’ve already try this… mem leak happens even when I use
“net/http” with Hpricot

You’ll have to share more of your implementation then. You deduced that
the cause wasn’t threads, but you’ve only shared the threaded part of
your code.

Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for
GC.

On Oct 19, 2009, at 18:10 , Rob D. wrote:

threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end

threads.each { |t| t.join() }

I can’t speak for your supposed “leaks” (pure ruby almost never has
leaks, but it does have many sneaky ways to make valid object
references that you can overlook), since I can’t figure out what your
code does or how it is supposed to work. threads is an array of thread
objects that you only ever push on and you never clear or shift off
of. Also, why oh why oh why are you doing a redo??? Overly complicated
code may not be the source of your leak (and in this case it may very
well be), but it certainly isn’t doing you any favors.

I almost always structure my threads as a pool of workers pulling from
a queue of tasks:

You’ll have to share more of your implementation then. You deduced that
the cause wasn’t threads, but you’ve only shared the threaded part of
your code.

Also, have you run your program with valgrind to verify that it is in
fact a memory leak? It could just be objects sitting around waiting for
GC.

Here the rest part of program, I did not think here can be any leaks,
this is why I posted code without this part.

module Common
def Common.post_it(link)
begin
puts link
agent = Net::HTTP
html = agent.get(URI.parse(link))
doc = Hpricot(html)
forms = doc.search “//form”
forms.each do |form|
puts form.search(“input”).length
end
rescue => err
puts err
file_write_safe(“logs/errorlog.txt”, “#{err.to_s}|#{link}\n”)
return false
rescue Timeout::Error
file_write_safe(“logs/errorlog.txt”, “Request Timeout|#{link}\n”)
end
end

private

def self.file_write_safe(file,message)
File.open(file, ‘a’) { |f|
f.flock(File::LOCK_EX)
f.write(message)
f.flock(File::LOCK_UN)
}
end
end

On 20.10.2009 03:10, Rob D. wrote:

require ‘yaml’
if Thread.list.size < 50 then
puts "total threads: " + Thread.list.size.to_s
results)
so I extract some data from page and save it to the file. Nothing
special as you see.

When I start this program without threads I getting the same results :frowning:

Please help, is this my fault or something wrong in the libraries ?

As far as I can see you’re never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don’t need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).

require ‘thread’

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1…THREADS).map do
Thread.new q do qq
until qq.equal?(url = qq.deq)
Common…
end
end
end

File.foreach(“bases/3+4.txt”) do |line|
line.chomp!
q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Kind regards

robert

On Oct 20, 2009, at 07:29 , Rob D. wrote:

I almost always structure my threads as a pool of workers pulling
from
a queue of tasks:

actually I had this kind of code at first:

yeah. I don’t care. it is still really obfuscated. Switch to a simple
pool of workers pulling tasks out of a queue. Don’t do redo for no
reason. Don’t do crazy stuff.

Ryan D. wrote:

On Oct 19, 2009, at 18:10 , Rob D. wrote:

threads.each { |t|
unless t.status then
t.join
end
}
puts "total threads: " + Thread.list.size.to_s
redo
end
end

threads.each { |t| t.join() }

I can’t speak for your supposed “leaks” (pure ruby almost never has
leaks, but it does have many sneaky ways to make valid object
references that you can overlook), since I can’t figure out what your
code does or how it is supposed to work. threads is an array of thread
objects that you only ever push on and you never clear or shift off
of. Also, why oh why oh why are you doing a redo??? Overly complicated
code may not be the source of your leak (and in this case it may very
well be), but it certainly isn’t doing you any favors.

I almost always structure my threads as a pool of workers pulling from
a queue of tasks:

actually I had this kind of code at first:

links.each do |link|
if Thread.list.size < 50 then
threads << Thread.new(link) { |myLink|
Common.post_it(myLink,doorway)
}
else
sleep(1)
threads.each { |t|
unless t.status then
t.join
end
}
threads = threads.delete_if {|t| t.status == false}
GC.start
redo
end
end

but it does not help, and I reduce this code

On 21.10.2009 00:47, Rob D. wrote:

require ‘thread’
end
good. :slight_smile:

But, this is not solve memory leak :frowning:

Any ideas or suggestions?

You could print out object statistics to get an idea about the source of
the leak. Something along the lines

require ‘pp’

def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end

Cheers

robert

As far as I can see you’re never cleaning Array threads. You are just
appending to the Array and so even terminated threads will stay in there
because you also do not reuse threads.

Design criticism: usually you create threads upfront and pass work to
them via a Queue then you don’t need the sleep crutch and use blocking
instead. Also, you can make your code better handle large volumes of
URLs by using a more streamed approach along the lines Ryan presented by
limiting all resources (not only threads but also size of the queue).

require ‘thread’

THREADS = 50
q = SizedQueue.new(THREADS * 2)

threads = (1…THREADS).map do
Thread.new q do qq
until qq.equal?(url = qq.deq)
Common…
end
end
end

File.foreach(“bases/3+4.txt”) do |line|
line.chomp!
q.enq(URI.parse(line))
end

threads.size.times {|q| q.enq q}

Very thank you, your solution is great. Code become much clear and looks
good. :slight_smile:

But, this is not solve memory leak :frowning:

Any ideas or suggestions?

You could print out object statistics to get an idea about the source of
the leak. Something along the lines

require ‘pp’

def print_class_counts
c = Hash.new 0
ObjectSpace.each_object(Object) do |o|
c[o.class] += 1
end
pp c
end

Cheers

robert

I retest my script again on 10k+ database and seems it stops to grow
when reach 550mb size… I’m running 50 threads, so 10mb per thread. As
I understand this is still not good ?

Robert, I did what you propose and here is some data. I print all
objects that have more than 200 copies.

Here object after 100 URLs:
“Class: 1039”
“Thread: 395”
“Gem::Version::Part: 343”
“Regexp: 1463”
“MatchData: 301”
“Hpricot::Text: 31848”
“Gem::Version: 424”
“Array: 24977”
“Nokogiri::XML::Element: 1636”
“WWW::Mechanize::Chain: 203”
“Gem::Requirement: 380”
“WWW::Mechanize::Form::Field: 627”
“String: 199615”
“Hpricot::Elem: 21211”
“Nokogiri::XML::SyntaxError: 1562”
“Hash: 15884”
“Proc: 949”
“WWW::Mechanize::Form::Option: 458”

Here after 400 URLs:
“Class: 890”
“Gem::Version::Part: 343”
“Regexp: 1089”
“MatchData: 264”
“Hpricot::BogusETag: 1573”
“Hpricot::Text: 82593”
“Gem::Version: 424”
“Array: 53259”
“Nokogiri::XML::Element: 2982”
“WWW::Mechanize::Form::Button: 259”
“WWW::Mechanize::Chain: 242”
“Gem::Requirement: 380”
“WWW::Mechanize::Form::Field: 1179”
“String: 426729”
“WWW::Mechanize::Form::RadioButton: 351”
“Nokogiri::XML::SyntaxError: 6464”
“Hash: 37025”
“Hpricot::Elem: 55426”
“Proc: 774”
“WWW::Mechanize::Form: 338”
“WWW::Mechanize::Form::Option: 601”

And here after 500 URLs:
“Class: 991”
“Thread: 354”
“Gem::Version::Part: 343”
“Regexp: 847”
“MatchData: 409”
“Hpricot::BogusETag: 915”
“Hpricot::Text: 31707”
“Gem::Version: 424”
“Array: 24143”
“Nokogiri::XML::Element: 2076”
“WWW::Mechanize::Form::Button: 217”
“WWW::Mechanize::Chain: 218”
“Gem::Requirement: 380”
“WWW::Mechanize::Form::Field: 874”
“String: 197836”
“Nokogiri::XML::SyntaxError: 2608”
“Hash: 14719”
“Hpricot::Elem: 21530”
“Proc: 925”
“WWW::Mechanize::Form: 266”
“WWW::Mechanize::Form::Option: 416”

As you can see there is no stable increasing amount of some object.

What can I do more?

Evening Rob,

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant
message
queue and just fire up workers that grab messages and download away.

I know it’s not the sexy option but what exactly are you gaining by
using
threads here? Much easier to get a nice single threaded worker tuned up
and
tight then what you appear to be going through here. In fact - I might
argue
that you would be better with a downloader and separater parser
processes
that worked independantly via messages.

Just a thought…

John

On 10/22/2009 04:52 AM, Rob D. wrote:

pp c
Robert, I did what you propose and here is some data. I print all
“Array: 24977”

“WWW::Mechanize::Form::Button: 259”
“WWW::Mechanize::Form::Option: 601”
“Array: 24143”
“WWW::Mechanize::Form: 266”
“WWW::Mechanize::Form::Option: 416”

As you can see there is no stable increasing amount of some object.

What can I do more?

Your number of threads is suspiciously high compared to the 50 threads
that you intend to run. That may be an indication that your code still
suffers the issue that you stuff Thread instances into the Array and
never remove them.

Other than that I cannot see something obvious on first sight. Maybe it
makes more sense to extend the memory dumping to dump only positive
deltas. An alternative is to decrease the dumping interval (e.g. just
after fetching just one URL).

If you want to throw more on this you could use set_trace_func to
capture invocations of methods named “new” along with the stack trace so
you get the allocation sites.

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

Kind regards

robert

John W Higgins wrote:

Evening Rob,

May I ask why you need threads and that level of complication? Are you
really that sensitive towards speed that it really matters if you simply
forked processes that died after 100 downloads and just started another
worker? It would seem to be easier to use a nice simple independant
message
queue and just fire up workers that grab messages and download away.
I know it’s not the sexy option but what exactly are you gaining by
using
threads here? Much easier to get a nice single threaded worker tuned up
and
tight then what you appear to be going through here.
Your solution is good and simple, but there is one problem… I’ll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.

In fact - I might
argue
that you would be better with a downloader and separater parser
processes
that worked independantly via messages.

I using mechanize, as I know it parse page right after download with
Nokogiri, so separating threads would be good when I start to use
net/http or sockets. Unfortunately right now I should use mechanize.

Your number of threads is suspiciously high compared to the 50 threads
that you intend to run. That may be an indication that your code still
suffers the issue that you stuff Thread instances into the Array and
never remove them.
Yeah, number of threads confusing me too, but I did as you propose:

THREADS = 50
q = SizedQueue.new(THREADS * 2)
threads = []
threads = (1…THREADS).map do
Thread.new q do |qq|
until qq.equal?(myLink = qq.deq)
mutex.synchronize do
puts $n +=1 # show number of link
if ($n % 100) == 0 then
print_class_counts # show classes information
end
end
Common2.post_it(myLink)
end
end
end

So I just don’t understand how amount of threads can be more than 50

Other than that I cannot see something obvious on first sight. Maybe it
makes more sense to extend the memory dumping to dump only positive
deltas. An alternative is to decrease the dumping interval (e.g. just
after fetching just one URL).

If you want to throw more on this you could use set_trace_func to
capture invocations of methods named “new” along with the stack trace so
you get the allocation sites.

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

ruby 1.8.6 (2008-08-11 patchlevel 287) - it’s strange because I’ve
updated it week ago :slight_smile:

On 10/22/2009 05:33 AM, John W Higgins wrote:

that worked independantly via messages.
With DRb you can even work with those multiple processes like they were
in a single process.

Cheers

robert

From: “Rob D.” [email protected]

Your solution is good and simple, but there is one problem… I’ll have
1-2 mil links in my database to crawl. With only one process it will
take months.
Beside URLs in my list sometimes are very slow, and response time may
take up to 10-20-30 second.
I know the best practice should would be multi-threads with asynchronous
sockets, but this is too complicated for me right now, maybe in the next
versions.

Might want to look at EventMachine: http://rubyeventmachine.com/

example (from lib/em/protocols/httpclient2.rb)

# === Usage
#
#  EM.run{
#    conn = EM::Protocols::HttpClient2.connect 'google.com', 80
#
#    req = conn.get('/')
#    req.callback{ |response|
#      p(response.status)
#      p(response.headers)
#      p(response.content)
#    }
#  }

You could create multiple such concurrent connections. And if your OS
supports it, EventMachine will make use of kqueue or epoll behind the
scenes for efficient handling of large numbers of I/O handles.

This is a simplistic example, but should work in principle:

sites = %w(aaa.com bbb.com ccc.com ddd.com eee.com fff.com) # etc…
conns = sites.map {|dn| EM::Protocols::HttpClient2.connect(dn, 80)}
conns.each do |conn|
req = conn.get(‘/’)
req.callback do |response|
p(response.status)
p(response.headers)
p(response.content)
end
end

This all takes place in a single thread.

Regards,

Bill

Btw, what version of Ruby is this? IIRC there was a bug with
Array#shift up to 1.8.6 which could also cause these effects.

ruby 1.8.6 (2008-08-11 patchlevel 287) - it’s strange because I’ve
updated it week ago :slight_smile:

Right now testing on ruby 1.8.7 … issue still exits :frowning:

On Thu, Oct 22, 2009 at 1:12 PM, Rob D. [email protected] wrote:

using
versions.
I’m sure he didn’t mean (only) 1 process but several processes each
running a subset of the input space. It could also have the benefit
of utilizing multiple processing cores on one machine or across
multiple machines:

http://raa.ruby-lang.org/project/rq/
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/298739