Why don't Ruby libraries share memory?

bouldr-matt · August 13, 2007, 8:27pm

This paragraph is motivation. While my question is not Rails-specific, I
am
asking it because of Rails. I’ve been investigating the memory footprint
of
my Mongrels. It is nice that they share the .so libraries from
ImageMagick
as well as other C libraries. However, each one still has about 20MB in
[heap]. My theory is that a lot of this is coming from ActiveRecord and
friends getting loaded again and again for each Mongrel, which seems to
me
entirely unnecessary. My “marginal cost of Apache” is 1376kB. My
“marginal
cost of Mongrel” is 27528kB, with the code I wrote. It seems that the
latter
could be reduced a lot by sharing some Ruby libraries.

The question is as follows: if I require ‘library’ in one instance of
Ruby
and then require ‘library’ again in another instance of Ruby, then do I
get
duplicate copies of library’s code in two chunks of my RAM? (I’m
thinking I
do.) Why?

For further details and perhaps clarification, consider the following
script:

require ‘smaps_parser’

smaps = SmapsParser.new(Process.pid)
puts smaps.sums.inspect

%w{rubygems active_record action_controller action_view RMagick}.each do
|l|
puts “\nRequiring #{l}.”
require l
smaps.refresh
puts smaps.sums.inspect
end

Though my Mongrel processes have already (each?) loaded copies of each
l,
and though there is nothing “private” about the code in each l, I get
the
following output, in which one should pay particular attention to the
increase of [:private_dirty]:

{:rss=>1520, :shared_clean=>964, :shared_dirty=>0, :private_clean=>12,
:size=>2968, :private_dirty=>544}

Requiring rubygems.
{:rss=>5032, :shared_clean=>1676, :shared_dirty=>0, :private_clean=>224,
:size=>7476, :private_dirty=>3132}

Requiring active_record.
{:rss=>12920, :shared_clean=>1816, :shared_dirty=>0,
:private_clean=>224,
:size=>15452, :private_dirty=>10880}

Requiring action_controller.
{:rss=>18680, :shared_clean=>1828, :shared_dirty=>0,
:private_clean=>228,
:size=>21152, :private_dirty=>16624}

Requiring action_view.
{:rss=>21088, :shared_clean=>1828, :shared_dirty=>0,
:private_clean=>228,
:size=>23524, :private_dirty=>19032}

Requiring RMagick.
{:rss=>22512, :shared_clean=>2660, :shared_dirty=>0,
:private_clean=>228,
:size=>29792, :private_dirty=>19624}

bouldr-matt · August 13, 2007, 8:47pm

On Aug 13, 2007, at 14:25 , Matt H. wrote:

The question is as follows: if I require ‘library’ in one instance
of Ruby and then require ‘library’ again in another instance of
Ruby, then do I get duplicate copies of library’s code in two
chunks of my RAM? (I’m thinking I do.) Why?

Matt,

Every time you load a Mongrel instance it is loading a completely new
Ruby runtime environment which can be modified in any way.

If you think about it, if each mongrel instance did share the
libraries, when one running application on the system modified some
of the library code at runtime it would effect all other running
instances. This could be… bad, at best.

So to answer what I think is your question, yes you get duplicate
copies in two “chunks” of your RAM.

Hope this helps,

~Wayne

s///g
Wayne E. Seguin
Sr. Systems Architect & Systems Administrator

bouldr-matt · August 13, 2007, 8:57pm

On 8/13/07, Matt H. [email protected] wrote:

The question is as follows: if I require ‘library’ in one instance of Ruby
and then require ‘library’ again in another instance of Ruby, then do I get
duplicate copies of library’s code in two chunks of my RAM? (I’m thinking I
do.) Why?

I suppose the main problem is that Rails (or ActiveRecord, I don’t
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

I don’t know where exactly is the problem with rails/ar though, nor
whether is it at least theoretically solvable.

bouldr-matt · August 13, 2007, 9:00pm

On 8/13/07, Jano S. [email protected] wrote:

I don’t know where exactly is the problem with rails/ar though, nor
whether is it at least theoretically solvable.

And one more note: you can save a bit of memory, if you put the thread
safe code into one drb server, although most probably it’s not worth
the effort.

bouldr-matt · August 13, 2007, 9:26pm

On 8/13/07, Jano S. [email protected] wrote:

The question is as follows: if I require ‘library’ in one instance of Ruby
and then require ‘library’ again in another instance of Ruby, then do I get
duplicate copies of library’s code in two chunks of my RAM? (I’m thinking I
do.) Why?

I suppose the main problem is that Rails (or ActiveRecord, I don’t
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

It’s actually what Wayne mentioned. Since all Ruby classes can be
modified at runtime, it would be very scary to share them across
separate process instances unless you explicitly wanted that behavior.

As a naive example, consider this:

require “set”
=> true

class Set
def icanhasset
puts “Oh hai, I is an instance method”
end
end

Set[].icanhasset
Oh hai, I is an instance method

Imagine this shared across separate processes running different types
of code. Any modifications would be shared, and that means that you
couldn’t meaningfully modify any classes without expecting problems or
weird bugs. Takes away half of the fun (and utility) of Ruby right
there.

-greg

bouldr-matt · August 13, 2007, 10:25pm

Thanks for your reply; but I am still wondering about a more general
question. It is my understanding that require ‘file.rb’ will execute the
code in file.rb, so if I require the same file in two different ruby
processes, then I have duplicates of its classes in memory. If I do
require
‘c_library.so’ then c_library.so will be loaded as a shared library. I
understand that one process might want to override some methods in
file.rb,
while the other one might depend on the original versions being intact.
This
is a good reason to load the library twice.

In some instances, though, you might know that there are not going to be
overrides, perhaps child classes at most. In this case the classes (no
objects) could be shared. Is there a way to do “shared Ruby libraries”
and
have them act like shared C libraries? (This question reveals my
ignorance
of how shared C libraries and OS kernels interact, but I suspect that C
(or
any compiled?) libraries are special.) DRb is not really what I am
asking
about; I mean to share only classes, not objects.

I have plenty of RAM to run my three Mongrel processes, which are
already
overkill for serving a whopping 30 vists and 900 hits per day. (Shameful
plug of http://www.teamdawg.org if you want to help me out with some
more
load.) Therefore, I am not really trying to do anything, just
theorizing.

I have seen widespread criticism of Rails as an poorly-scalable memory
hog,
to which there are replies of, “Optimize your code,” (often due to
ActiveRecord::Base.find generating lots of SELECT * queries) or “Buy
more
RAM and servers until you bring down your database,” (which will happen
pretty quickly with egregious SELECT *), to “Check your logs; your
database
is already the problem.” I think Rails is great and Ruby is even
greater; in
fact I want to see them take over the world. It could happen a lot
faster if
we could address criticisms like the above, and when a library is as
large
as ActiveRecord, loading even one time too many is already cause for
criticism.

Sorry, I started talking about Rails again. The question is not about
Rails.
The questions are: Is there any way we can have shared Ruby libraries
without turning the relevant code into a C extension? Is it necessary
that
code be compiled to be put into shared memory by the OS? (Feel free to
tell
me I’m being really stupid.) For instance, for all the GTK applications
you
run your system needs to load GTK only once. It would be really nice if
this
could be true of Ruby libraries. I have a feeling that this may just be
a
limitation of interpreted languages. Please explain.

“Jano S.” [email protected] wrote in message
news:[email protected]…

bouldr-matt · August 14, 2007, 2:07am

Matt H. [email protected] (22:19) schrieb:

Sorry, I started talking about Rails again. The question is not about
Rails. The questions are: Is there any way we can have shared Ruby
libraries without turning the relevant code into a C extension? Is it
necessary that code be compiled to be put into shared memory by the
OS?

C-libraries have 2 nice features: They are read only, and ready to use
in a disk file. So a modern OS can just map that file into memory and
use the same physical memory for every process accessing the file.

Ruby classes aren’t read only. But you could just write the changes into
memory and keep the constant part in a file. So the real problem is that
the source aren’t ready to use for a modern interpreter. The data
structures the compiler works on are nowhere on disk, they are
dynamically created when the source files are evaluated.

The source files don’t hog up memory, they can be freed after being
parsed or just memory mapped. It’s the parsers result, the structures
the interpreter works on, that need the memory.

Ruby would have to use “precompiled” source files to be able to use
memory mapping. It could use copy-on-write to dynamically change the
code. There still much work on Ruby, so there may be a chance for that.

mfg, simon … l

bouldr-matt · August 14, 2007, 5:30am

On Aug 13, 2007, at 11:25, Matt H. wrote:

The question is as follows: if I require ‘library’ in one instance
of Ruby and then require ‘library’ again in another instance of
Ruby, then do I get duplicate copies of library’s code in two
chunks of my RAM? (I’m thinking I do.) Why?

You’ll get closer to the behavior you expect if you use Kernel#fork
to spawn new instances rather than starting up from the shell.

This is how Apache costs only 1376kB.

bouldr-matt · August 14, 2007, 8:58am

Daniel DeLorme wrote:

The problem goes further than that. Even if you were to load your libs
in one process and then fork off worker processes (using copy-on-write
to share loaded code), the gargabe collector writes to every page in
memory when doing a garbage collecting run, thus negating the benefits
of COW. It’s fixed in 1.9, thankfully, but 1.8 is going to be a memory
hog no matter which way you look at it.

What does 1.9 do differently?

bouldr-matt · August 14, 2007, 5:15am

Matt H. wrote:

Sorry, I started talking about Rails again. The question is not about
Rails. The questions are: Is there any way we can have shared Ruby
libraries without turning the relevant code into a C extension? Is it
necessary that code be compiled to be put into shared memory by the OS?

The problem goes further than that. Even if you were to load your libs
in one process and then fork off worker processes (using copy-on-write
to share loaded code), the gargabe collector writes to every page in
memory when doing a garbage collecting run, thus negating the benefits
of COW. It’s fixed in 1.9, thankfully, but 1.8 is going to be a memory
hog no matter which way you look at it.

Daniel

bouldr-matt · August 14, 2007, 11:30am

On Aug 13, 2007, at 20:14, Daniel DeLorme wrote:

every page in memory when doing a garbage collecting run, thus
negating the benefits of COW. It’s fixed in 1.9, thankfully, but
1.8 is going to be a memory hog no matter which way you look at it.

For .so files, no, for .rb files, yes.

bouldr-matt · August 14, 2007, 4:42pm

On Tue, 14 Aug 2007, Jano S. wrote:

I suppose the main problem is that Rails (or ActiveRecord, I don’t
know exactly) is not thread-safe. That means you cannot share most of
its the data. That is the reason why you have to run more mongrels
compared to a one multi-threaded mongrel.

This is not the issue, really, when talking about issues of throughput
with Ruby. Only in uncommon cases will a multithreaded Ruby program
actually deliver greater processing throughput than a single threaded
Ruby
program. Basically, you have to have external latencies that can be
captured without your code waiting on those latencies inside of an
extension.

If you run Rails (or some other web framework app) in a multithreaded
mode, then yes, more than one request can be inside of the code, being
handled at the same time. The handling of each of those requests will
be
substantially slower, though, than if a process handles a single request
at a time. It may be a win with regard to app behavior, if there are
fast
actions in the same app with very slow actions, and one only wants to
run
a single, or a very small number of processing nodes (mongrels), because
it lets the fast actions run to completion without requring them to wait
on the slow actions. But from the POV of overall throughput, it is not
a
win.

Kirk H.