You thoughts/philosphies on manual garbage collection

dkmd_nielsen · March 8, 2007, 11:15pm

The process that initiated my message earlier (about deleting array
elements) is a rather long running process of rebuilding and
reconfiguring parameter files. There hundreds of files, each with as
many as 22,000 parameters to processed. For example, four small test
files ran in about two minutes. There is a ton of string manipulation
going on, which probably translated into lots of trailing string parts
and pointer laying around RAM…clogging it up. I was thinking of
manually initiating garbage collection after every five or ten files
processed. Is that a smart thing?

What are yours thoughts on manually initiated garbage collection?
What kinds of practices result in bits and pieces of objects and
pointers being left laying around in the ether of RAM? Are there
tools that help see what happens to RAM while a process runs, like a
debugger does with variables?

Thanks for everything
dvn

dkmd_nielsen · March 8, 2007, 11:19pm

On Fri, 9 Mar 2007, dkmd_nielsen wrote:

What are yours thoughts on manually initiated garbage collection?
What kinds of practices result in bits and pieces of objects and
pointers being left laying around in the ether of RAM? Are there
tools that help see what happens to RAM while a process runs, like a
debugger does with variables?

Thanks for everything
dvn

if you can fork - that’s the best - then you just let each child’s death
clean
up that sub-segment of work’s memory.

-a

dkmd_nielsen · March 9, 2007, 11:10am

On 08.03.2007 23:18, [email protected] wrote:

processed. Is that a smart thing?
To OP: generally “manual” GC is considered bad since it interferes with
the automatic mechanism.

clean
up that sub-segment of work’s memory.

Also, forking has the added advantage of better utilizing multi core
CPU’s.

If you do encounter excessive memory usage then you should

a) make sure you do not hold onto stuff longer than needed

b) check your algorithms for inefficient dealing with objects; since you
mention string processing, this is a typical gotcha:

s += “foo” # creates a new string
s << “foo” # just appends to s

Another one

a=[]
a += [“foo”, “bar”] # creates another array
a << “foo” << “bar” # just appends
a.concat [“foo”, “bar”] # just appends

c) If files you are processing are large then you might also try to do
some kind of stream processing where you do not have to keep the whole
file’s content in memory (if that’s applicable to your problem domain).

Kind regards

robert

dkmd_nielsen · March 13, 2007, 12:53pm

Joel VanderWerf wrote:

parent has large heap, and

child lifespan and allocation rate are such that is does not need to GC

Some benchmarks:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/186561

I looked for some extra information on this topic and found:
http://blog.beaver.net/2005/03/ruby_gc_and_copyonwrite.html

That’s pretty disheartening news to me. I had plans to make a fcgi-like
process manager that would take advantage of copy-on-write to reduce the
memory footprint of a webapp by pre-loading all libraries in the parent
process. But if ruby’s GC renders COW useless… there’s not much point
anymore.

Are there any plans to optimize ruby to make it fork-friendly?

Daniel

dkmd_nielsen · March 13, 2007, 4:34pm

On Mar 11, 2007, at 3:26 PM, Joel VanderWerf wrote:

[email protected] wrote:

if you can fork - that’s the best - then you just let each child’s
death clean
up that sub-segment of work’s memory.

One caution: mark-and-sweep GC and fork don’t always play well
together, in terms of sharing memory pages. The mark algorithm
needs to touch all live objects in the heap. The child inherits the
parent’s heap, with copy on write.

I think you are describing a different situation than the OP and Ara.

If you’ve got hundreds of files to process and the processing is
sufficiently
complex to justify forking for each file then the parent just
iterates over
the file list forking and waiting for each child to process each
file. The
parent’s address space won’t have all the stale objects generated by
the child’s
processing so each new child starts with a reasonable memory footprint.

One fork per file is the easiest to program but if that is
problematic for
some reason you could batch things up pretty easily.

Gary W.

dkmd_nielsen · March 11, 2007, 8:26pm

[email protected] wrote:

processed. Is that a smart thing?
if you can fork - that’s the best - then you just let each child’s death
clean
up that sub-segment of work’s memory.

One caution: mark-and-sweep GC and fork don’t always play well together,
in terms of sharing memory pages. The mark algorithm needs to touch all
live objects in the heap. The child inherits the parent’s heap, with
copy on write. If the parent has a large heap, and the child does a GC,
all those pages are copied into the child’s address space. Memory
usage will scale badly as the number of child processes grows. (Perhaps
you factor your process into one child for each of the hundreds of
files?)

It can be a good idea to GC.disable in the child, in some cases:

parent has large heap, and
child lifespan and allocation rate are such that is does not need to
GC

Some benchmarks:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/186561