Re: JRuby performance questions answered

igouy · November 7, 2007, 12:27am

Quoting C. Oliver N. <charles.nutter / sun.com>:

znmeb / cesmail.net wrote:

Quoting C. Oliver N. <charles.nutter / sun.com>:

Many people believed we’d never be faster than the C
implementation,
and many still think we’re slower. Now that I’ve set that record
straight, any questions?

How long will it be before Alioth has some reasonable numbers
=20
for jRuby? As of yesterday, they still have you significantly =20
slower than MRI. So I need to take jRuby out of my slides for =20
RubyConf … I

The current published Alioth numbers are based on JRuby 1.0(ish),
which
was generally 2-3x slower than MRI. I’m hoping the numbers will be
updated soon after the 1.1 releases…but it probably won’t happen
until 1.1 final comes out in December. If someone else wants to
re-run
them for us, it would make us very happy

An “Update Programming Language” “Feature Request” will usually get our
attention.

Coincidentally, I did grab 1.1b1 so the benchmarks game has new
measurements

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=jruby

igouy · November 7, 2007, 12:38am

Coincidentally, I did grab 1.1b1 so the benchmarks game has new
measurements

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=jruby

Wow it would appear that jruby is indeed faster, and indeed uses a lot
more memory (or maybe that’s just startup overhead). thanks for a
good program!

igouy · November 7, 2007, 3:51am

Roger P. wrote:

Coincidentally, I did grab 1.1b1 so the benchmarks game has new
measurements

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=jruby

Wow it would appear that jruby is indeed faster, and indeed uses a lot
more memory (or maybe that’s just startup overhead). thanks for a
good program!

It’s another well-known fact about running on the JVM that we have to
suck it up and accept there’s an initial memory chunk eaten up by every
JVM process. If one excludes that initial cost, most measurements have
us using less memory than C Ruby…so for very large apps we end up
coming out ahead. But for small, short apps, the initial slow startup
and high memory usage is going to be a battle we fight for a long time.

Charlie

igouy · November 7, 2007, 5:35am

Roger P. wrote:

If you run multiple threads I assume there isn’t an extra memory cost
for that–is that right?

Every thread is going to need its own stack, but that’ll be small
compared to the startup overhead. I’m sure Charles will elaborate.

I’m guessing that JRuby still doesn’t support continuations…? I
think that would require the “spaghetti stack” model, which would
remove most of the per-thread initial stack overhead.

Clifford H…

igouy · November 7, 2007, 4:31am

It’s another well-known fact about running on the JVM that we have to
suck it up and accept there’s an initial memory chunk eaten up by every
JVM process. If one excludes that initial cost, most measurements have
us using less memory than C Ruby…so for very large apps we end up
coming out ahead. But for small, short apps, the initial slow startup
and high memory usage is going to be a battle we fight for a long time.

Charlie

If you run multiple threads I assume there isn’t an extra memory cost
for that–is that right?

igouy · November 7, 2007, 8:34am

Roger P. wrote:

for that–is that right?
Yes, generally. They won’t be as light as Ruby’s green threads, but then
Ruby’s threads can’t actually run in parallel anyway.

Charlie

igouy · November 7, 2007, 8:32am

Clifford H. wrote:

Roger P. wrote:

If you run multiple threads I assume there isn’t an extra memory cost
for that–is that right?

Every thread is going to need its own stack, but that’ll be small
compared to the startup overhead. I’m sure Charles will elaborate.

Our threads will be a lot more expensive than Ruby’s, but a lot cheaper
than a separate process in either world.

I’m guessing that JRuby still doesn’t support continuations…? I
think that would require the “spaghetti stack” model, which would
remove most of the per-thread initial stack overhead.

Our official stance is that JRuby won’t support continuations until the
JVM does. We could emulate them by forcing a stackless implementation,
but it would be drastically slower than what we have now.

Charlie

igouy · November 7, 2007, 8:37am

Roger P. wrote:

Coincidentally, I did grab 1.1b1 so the benchmarks game has new
measurements

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=all&lang=jruby

Wow it would appear that jruby is indeed faster, and indeed uses a lot
more memory (or maybe that’s just startup overhead). thanks for a
good program!

I just committed an addition to JRuby that allows you to spin up a
“server” JRuby instance (using “Nailgun”) in the background and feed it
commands. See the startup difference using this:

normal:

~/NetBeansProjects/jruby $ time jruby -e “puts ‘hello’”
hello

real 0m1.944s
user 0m1.511s
sys 0m0.138s

nailgun:

~/NetBeansProjects/jruby $ time jruby-ng -e “puts ‘hello’”
hello

real 0m0.103s
user 0m0.006s
sys 0m0.009s

Here’s a post from the JRuby list describing how to use this, for those
of you that are interested. Also, this allows you to avoid the startup
memory cost for every command you run since you can just issue commands
to that running server and it will re-use memory. After running a bunch
of commands on my system, that server process was still happily under
60M, and never went any higher.

…

I’ve got Nailgun working with JRuby just great now.

bin/jruby-ng-server
bin/jruby-ng

If you want to use the server, say if you’re going to be running a lot
of command-line tools, just spin it up in the background somewhere.

jruby-ng-server > /dev/null 2> /dev/null &

And then use the jruby-ng command instead, or alias it to “jruby”

alias jruby=jruby-ng

You’ll need to make the ng client command on your platform, by running
‘make’ under bin/nailgun, but then everything should function correctly.

jruby-ng -e “puts ‘here’”

The idea is that users will have a new option to try. For JRuby, where
we have no global variables, no dependencies on static fields, and
already depend on our ability to spin up many JRuby instances in a
single JVM, this ends up working very well. It’s building off features
we already provide, and giving users the benefit of a fast,
pre-initialized JVM without the startup hit.

I think we’re probably going to ship with this for JRuby 1.1 now. It’s
working really well. I’ve managed to resolve the CWD issue by defining
my own “nailMain” next to our existing “main”, and ENV vars are being
passed along as well. The one big remaining complication I don’t have an
answer for just yet is OS signals; they get registered only in the
server process, so signals from the client don’t propagate through. It’s
fixable of course, by having the client register and the server just
listen for client signal events, but that isn’t supported in the current
NG. So there’s some work to do.

All the NG stuff is in JRuby trunk right now. Give it a shot. I’m
interested in hearing opinions on it.

Charlie

igouy · November 10, 2007, 2:39pm

On 11/9/07, Roger P. [email protected] wrote:

Wow it would appear that jruby is indeed faster, and indeed uses a lot
more memory (or maybe that’s just startup overhead). thanks for a
good program!

I wonder if jruby uses reference counting for its ruby objects (or if it
even matters), and if not maybe someday it would I’m just in a pro
reference counting mood these days

I very much doubt it.

Roger, you REALLY need to read the literature on GC which has been
accumulating for the past 50 years.

Reference counting is pretty much an obsolete approach to GC. It was
probably the first approach taken for lisp back in the 1950s. Other
language implementations usually started with reference counting (e.g.
the first Smalltalk).

It’s main advantage is that it’s easy to understand. On the other hand
it incurs a large overhead since counts need to be
incremented/decremented on every assignment. It can’t detect circular
lists of dead objects. In early Smalltalk programs when reference
counting was used, you needed to explicitly nil out references to
break such chains. There’s also the issue of the overhead for storing
the reference count, and how many bits to allocate. Most reference
counting implementations punt when the reference count overflows, they
treat a ‘full’ count as an infinite count and no longer decrement it,
leading to more uncollectable objects.

Mark and sweep, such as is used in the Ruby 1.8 implementation quickly
replaced reference counting as the simplest GC considered for real
use.

More modern GCs tend to use copying GCs which move live objects to new
heap blocks leaving the dead ones behind. And most use generational
scavenging which takes advantage of the observation that most objects
either die quite young, or live a long time. This approach was
pioneered by David Ungar in the Berkeley implementation of
Smalltalk-80. And this is the kind of GC typically used in JVMs
today.

Which particular GC approach is best for Ruby is subject to some study.

Many of the usages of ruby aren’t quite like those of Java, or
Smalltalk. I had dinner with a former colleague, who happens to be
the lead developer of the IBM J9 java virtual machine, and he made the
observation that Java, and Smalltalk before it have a long history of
having their VMs tuned for long running processes. On the other hand
many Ruby usages are get in and get out. These use cases mean that
it’s more valuable to have rapid startup than perfect GC in the sense
that all dead objects are reclaimed quickly, not that any of the
current GCs guarantee the latter.

So the best GC for Ruby might not be the same as would be used for a
JVM or Smalltalk VM, but I’m almost certain it would be a reference
counter.

–
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

igouy · November 10, 2007, 12:55am

Wow it would appear that jruby is indeed faster, and indeed uses a lot
more memory (or maybe that’s just startup overhead). thanks for a
good program!

I wonder if jruby uses reference counting for its ruby objects (or if it
even matters), and if not maybe someday it would I’m just in a pro
reference counting mood these days

-Roge

igouy · November 10, 2007, 9:35pm

Rick DeNatale wrote:

Reference counting is pretty much an obsolete approach to GC. It was
probably the first approach taken for lisp back in the 1950s. Other
language implementations usually started with reference counting (e.g.
the first Smalltalk).

It’s main advantage is that it’s easy to understand.

I don’t think reference counting is any easier to understand than pure
mark-and-sweep or pure stop-and-copy. The main advantage of reference
counting in my opinion is that its restrictions force you to kick some
features out of your language design if you want to use it.

Mark and sweep, such as is used in the Ruby 1.8 implementation quickly
replaced reference counting as the simplest GC considered for real
use.

My recollection is that mark-and-sweep was the original, and that
reference counting came later.

More modern GCs tend to use copying GCs which move live objects to new
heap blocks leaving the dead ones behind. And most use generational
scavenging which takes advantage of the observation that most objects
either die quite young, or live a long time. This approach was
pioneered by David Ungar in the Berkeley implementation of
Smalltalk-80. And this is the kind of GC typically used in JVMs
today.

Bah … I actually found a reference a couple of days ago on this
(http://portal.acm.org/citation.cfm?id=91597). If you’re not signed up
for the ACM library it will cost you money to read it. But essentially
“pure” mark-and-sweep was replaced by stop-and-copy, which compacts the
heap. Then generational mark-and-sweep came along and “rehabilitated”
mark-and-sweep. Note the publication date – 1990. The abstract is free
– it reads:

“Stop-and-copy garbage collection has been preferred to mark-and-sweep
collection in the last decade because its collection time is
proportional to the size of reachable data and not to the memory size.
This paper compares the CPU overhead and the memory requirements of the
two collection algorithms extended with generations, and finds that
mark-and-sweep collection requires at most a small amount of additional
CPU overhead (3-6%) but, requires an average of 20% (and up to 40%) less
memory to achieve the same page fault rate. The comparison is based on
results obtained using trace-driven simulation with large Common Lisp
programs.”

Which particular GC approach is best for Ruby is subject to some study.

I think at least for Rails on Linux, someone (assuming funding) could
collect and analyze plenty of data. I’d actually be surprised if someone
isn’t doing it, although I know I’m not.

Many of the usages of ruby aren’t quite like those of Java, or
Smalltalk. I had dinner with a former colleague, who happens to be
the lead developer of the IBM J9 java virtual machine, and he made the
observation that Java, and Smalltalk before it have a long history of
having their VMs tuned for long running processes. On the other hand
many Ruby usages are get in and get out. These use cases mean that
it’s more valuable to have rapid startup than perfect GC in the sense
that all dead objects are reclaimed quickly, not that any of the
current GCs guarantee the latter.

Well … OK. If you want to distinguish between long running (server)
and rapid startup (client), that’s fine. But look at the marketplace. We
have servers, we have laptop clients, we have desktop clients, we have
mobile clients, and we have bazillions of non-user-programmable
computers like DVD players, iPods, in-vehicle navigation systems, etc.

Now while the hard-core hackers like me wouldn’t buy an iPod or a DVD
player, preferring instead to add hard drive space to a real computer,
Apple isn’t exactly going broke making iPods and iPhones that are (for
the moment, anyhow) closed to “outsiders”. And I’m guessing that, while
you can run Ruby on, say, an embedded ARM/Linux platform, most of the
software in those gizmos is written in C and heavily optimized.

I’ve got a couple of embedded toolkits, and I’ve actually built Ruby for
them, but when you only have 32 MB of RAM, you don’t want to collect
garbage – you don’t even want to generate garbage! So I wouldn’t
personally spend much time thinking about garbage collection for rapid
startup. If you want rapid startup, you’re going to have as much binding
as possible done at compile time – you aren’t even going to compile a
Ruby script to an AST when you start a process up.

So the best GC for Ruby might not be the same as would be used for a
JVM or Smalltalk VM, but I’m almost certain it would be a reference
counter.

Did you mean to say, “not be a reference counter”?

igouy · November 10, 2007, 10:47pm

Those features being “finalizers”? IMHO its main advantage is being
prompt, so you don’t have to worry about resources hanging around after
they’re no longer needed.

I agree–it seems that the promptness would allow it to take advantage
of the cpu caches to still be fast.
The disadvantage, as some people above have pointed out, is that you may
lose compactness of the heap space.
Also it requires extension’s ‘containers’ objects (those that include
references to other objects that might somehow create cycles) to provide
a ‘traverse’ funcion which yields a list of accessible pointers so that
you can traverse containers and stomp cycles every so often. Very
similar to today’s ‘gc_mark’ function that they already provide.
Today’s extensions would also have to be slightly rewritten to use the
‘dec’ and ‘inc’ functions for the reference count of contained objects
(similar to their gc_mark function, again).

So anyway I agree–promptness is good.
I don’t know too much on the subject, though, having never read a paper
on it
-Roger

igouy · November 10, 2007, 11:37pm

Roger P. wrote:

I agree–it seems that the promptness would allow it to take advantage
of the cpu caches to still be fast.
The disadvantage, as some people above have pointed out, is that you may
lose compactness of the heap space.

I’m not sure on this one. Given that a compacting collector needs
several times as much RAM available as in use to be efficient, and that
a reference-counting collector probably gives no more fragmentation than
malloc, it’s hard to say which way locality would go.

The traditional objection to reference counting is that you spend a lot
of time adjusting reference counts. But with CPUs are so much faster
than RAM nowadays, that may matter less. Anyways, for more than you
ever wanted to know about GC, here’s a slightly-dated but still
excellent survey paper:

ftp://ftp.cs.utexas.edu/pub/garbage/bigsurv.ps

igouy · November 11, 2007, 6:45am

I’m not sure on this one. Given that a compacting collector needs
several times as much RAM available as in use to be efficient, and that
a reference-counting collector probably gives no more fragmentation than
malloc, it’s hard to say which way locality would go.

No joke sometimes I agree and malloc is just ‘good enough’

The traditional objection to reference counting is that you spend a lot
of time adjusting reference counts. But with CPUs are so much faster
than RAM nowadays, that may matter less. Anyways, for more than you
ever wanted to know about GC, here’s a slightly-dated but still
excellent survey paper:

ftp://ftp.cs.utexas.edu/pub/garbage/bigsurv.ps

Thank you. I’ve wondered about this, myself, as, to my limited
knowledge, a generational GC would need to ‘alias’ everything that’s
allocated (so it could move them to different generations), which would
involve a memory redirection. I could be wrong. If so then that’s a
drawback to it. Whereas for RC, like you said, the objects themselves
are already in cache, so the cpu can inc them quickly, and, IMO in the
lifetime of an object, how many times is it going to be inc’ed? Maybe a
few times plus once per scope change where it is assigned? Seems not
too often, as typically few objects are within a given scope,
AFAIK–maybe class variables and local variables. I would imagine that
the counts aren’t changed all that much, and, if they are, at least it’s
not changing the counts on all objects in memory (like mark and sweep),
and it spreads the GC over time instead of huge show stoppers.

Just my latest $.02 spouting off steam.
Have a good evening.
-Roger

igouy · November 10, 2007, 10:40pm

M. Edward (Ed) Borasky wrote:

Rick DeNatale wrote:

It’s main advantage is that it’s easy to understand.

I don’t think reference counting is any easier to understand than pure
mark-and-sweep or pure stop-and-copy. The main advantage of reference
counting in my opinion is that its restrictions force you to kick some
features out of your language design if you want to use it.

Those features being “finalizers”? IMHO its main advantage is being
prompt, so you don’t have to worry about resources hanging around after
they’re no longer needed.

igouy · November 11, 2007, 9:35am

On Nov 10, 9:45 pm, Roger P. [email protected] wrote:

I’m not sure on this one. Given that a compacting collector needs
several times as much RAM available as in use to be efficient, and that
a reference-counting collector probably gives no more fragmentation than
malloc, it’s hard to say which way locality would go.

No joke sometimes I agree and malloc is just ‘good enough’

Heap fragmentation is quite a big problem with malloc, you can see
that just by
the number of malloc and other memory allocation frameworks that have
been
written over the years.

knowledge, a generational GC would need to ‘alias’ everything that’s
allocated (so it could move them to different generations), which would
involve a memory redirection. I could be wrong.

You are. Generational GCs (I wrote one for Rubinius) do not need
double
the memory as I assume you’re implying. They use what’s called a write
barrier
(a small chunk of code) that runs whenever an object reference is
stored
in another object. This code is very small and simply updates a small
table.
That table is used by the GC to make sure that it runs properly and
can
update object references as objects move around.

If so then that’s a
drawback to it. Whereas for RC, like you said, the objects themselves
are already in cache, so the cpu can inc them quickly, and, IMO in the
lifetime of an object, how many times is it going to be inc’ed? Maybe a
few times plus once per scope change where it is assigned? Seems not
too often, as typically few objects are within a given scope,
AFAIK–maybe class variables and local variables. I would imagine that
the counts aren’t changed all that much, and, if they are, at least it’s
not changing the counts on all objects in memory (like mark and sweep),
and it spreads the GC over time instead of huge show stoppers.

I suggest you look at all the research done on reference counting
algorithms
versus sweep ones. Most if not all research shows that reference
counting is slower
and more prone to bugs than modern techniques.

igouy · November 11, 2007, 7:25pm

On Nov 11, 6:11 am, Laurent S. [email protected]
wrote:

No joke sometimes I agree and malloc is just ‘good enough’

ever wanted to know about GC, here’s a slightly-dated but still
double
On a completely unrelated note, I was wondering… how did you manage
to keep compatibility with existing C extensions without requiring the
developer to explicitly set write barriers, in some cases?

The key is that C extensions don’t have direct access to object
references. A C extension accesses all objects via a handle table. A
handle is what a C extension sees as an object. This lets the GC
mutate objects (which are also in the handle table) but keep the
handles at constant addresses (so they can be stored on the C stack).

The big problem with this approach is the RARRAY(), RSTRING(), etc
macros, that access an object directly as C data structure. Thats the
main reason for trying to move MRI away from using these macros and to
using something that looks like a function call that we in rubinius
can implement differently.

Evan

igouy · November 11, 2007, 8:07pm

[email protected] wrote:

On Nov 11, 6:11 am, Laurent S. [email protected]
…

The big problem with this approach is the RARRAY(), RSTRING(), etc
macros, that access an object directly as C data structure. Thats the
main reason for trying to move MRI away from using these macros and to
using something that looks like a function call that we in rubinius
can implement differently.

The extension code must also lock the handle and unlock in rb_ensure?

igouy · November 11, 2007, 8:19pm

[email protected] wrote:

The big problem with this approach is the RARRAY(), RSTRING(), etc
macros, that access an object directly as C data structure. Thats the
main reason for trying to move MRI away from using these macros and to
using something that looks like a function call that we in rubinius
can implement differently.

This is also, incidentally, why JRuby doesn’t support extensions yet.
The same techniques in Rubinius would apply equally well to JRuby
through a JNI-level Ruby API. But so long as extensions abuse their
direct memory access privileges, neither Rubinius nor JRuby can run
them.

Charlie

igouy · November 11, 2007, 3:11pm

On Nov 11, 2007 9:35 AM, [email protected] [email protected] wrote:

the number of malloc and other memory allocation frameworks that have

ftp://ftp.cs.utexas.edu/pub/garbage/bigsurv.ps
(a small chunk of code) that runs whenever an object reference is
stored
in another object. This code is very small and simply updates a small
table.
That table is used by the GC to make sure that it runs properly and
can
update object references as objects move around.

On a completely unrelated note, I was wondering… how did you manage
to keep compatibility with existing C extensions without requiring the
developer to explicitly set write barriers, in some cases?

(Sorry for being off-topic.)

Laurent