Perf optimization using profile results

dubstep · December 13, 2011, 4:57pm

I need some help with optimizing a set of libraries that I use. They are
ffi-rzmq, zmqmachine and rzmq_brokers (all up on github).

I ran a ‘latency’ test using the highest-level library (rzmq_brokers)
and noticed that, as I add each network client to the test, each
iteration takes another 300 microseconds to complete. From my testing,
JRuby is far and away the fastest runtime executing my code so I have
used it to gather some profile statistics. Right now I have some local
changes in my zmqmachine and rzmq_brokers repositories that haven’t been
pushed to github, so another party can’t currently reproduce what I am
seeing. I am working right now to clean up those commits so that I can
push them out.

I did two runs with JRuby using the --server option. The first run was
with --profile to produce a flat profile results. The second run was
with --profile.grah to produce the graph profile results.

gist.github.com

https://gist.github.com/chuckremes/1472608

profile.txt

Thread[RubyThread-4: /Users/cremes/.rvm/gems/jruby-1.6.3/gems/zmqmachine-0.7.1/lib/zm/reactor.rb:134,5,main] profile results:
Total time: 586.93

     total        self    children       calls  method
----------------------------------------------------------------
    586.90        0.12      586.78       73981  ZMQMachine::Reactor#run_once
    574.67        0.23      574.45       73981  ZMQMachine::Reactor#poll
    538.57        0.20      538.37       73978  ZMQ::Poller#poll
    535.97      535.97        0.00       73978  ZMQ::LibZMQ.zmq_poll
     37.07        0.36       36.71      463250  Array#each

This file has been truncated. show original

profile_graph.txt

Thread[RubyThread-4: /Users/cremes/.rvm/gems/jruby-1.6.3/gems/zmqmachine-0.7.1/lib/zm/reactor.rb:134,5,main] profile results:

Total time: 819.00

 %total   %self       total        self    children                 calls  name
---------------------------------------------------------------------------------------------------------
   100%      0%      819.00       -0.01      819.01                     0  (top)
                     818.97        0.14      818.83           98252/98252  ZMQMachine::Reactor#run_once
                       0.05        0.05        0.00           98252/98252  ZMQMachine::Reactor#running?
---------------------------------------------------------------------------------------------------------

This file has been truncated. show original

I would be grateful if another set (or several sets) of eyes could take
a look at the profile results and suggest places I should focus on for
reducing that 300 usec latency per client. I have already reduced it
from 700 usec to 300 usec by replacing a regular Array with a
SortedArray (where data is stored sorted so that lookups are fast). I’m
hoping to get another 50% reduction.

Thanks!

cr

cremes · December 13, 2011, 5:33pm

On Dec 13, 2011, at 9:57 AM, Chuck R. wrote:

Thanks!
In case someone wants to try and duplicate what I am doing, all of the
code has been pushed up to github. I’m really hoping someone can look at
the profile results and suggest 2-3 places that are ripe for
optimization, but in the event you are feeling frisky now you can
reproduce the whole shebang on your own machine.

The aforementioned code repositories are:

git://github.com/chuckremes/ffi-rzmq.git

git://github.com/chuckremes/zmqmachine.git

git://github.com/chuckremes/rzmq_brokers.git

In rzmq_brokers/test, run:

ruby test_consensus.rb <any open port number, e.g. 5556>

The test adds a new network client for each 10_000 iterations and prints
the average round-trip latency of those 10k messages. As each client is
added, latency goes up about 300 usec (on my machine, YMMV). I would
like to reduce that number.

cr

cremes · December 14, 2011, 3:03am

On Dec 13, 2011, at 7:03 PM, Charles Oliver N. wrote:

A couple quick observations.

I assume it’s expected that the majority of time is spent in
zmq_poll, which is an FFI Call to zmq. Excluding that from
consideration…

Yes. The call to zmq_poll is essentially time spent blocked on I/O.

This is bad:

16.57 0.76 15.81 3079083 Class#new

Three million new classes created. I’m sure it’s singleton somewhere,
but that’s often a big perf sink…not necessarily because it impacts
straight-line code a lot, but because a class object is a nontrivial
thing to be creating three million of.

Is there a way for JRuby to output more detail on those classes? For
example, can it tell me how many of each class type have been allocated?
If so, I could use that information to figure out the critical path and
try to reduce the amount of class creation there.

Also, I’m not certain what you mean by “I’m sure it’s singleton
somewhere.” I know what a singleton is, but I don’t understand your use
of it in this context. Elaborate, please.

Going down the list, I don’t see anything obvious to optimize in
most of the hot methods. Usually when you have normalish-looking code
that’s really hot, the next thing to optimize is reducing object
allocation as much as possible. In fact, that more than anything often
improves perf on JRuby…but I’m not sure the effect on other impls.

I’ll take a shot at reducing my object allocation. More information on
how many of each type of object is being allocated would be a wonderful
statistic to have. Also, if there were a way to trace that allocation
back to a particular file and line number (for those cases where the
same object is allocated in multiple places), that would be a killer
profiling feature. <hint, hint>

Thanks for the pointers. I appreciate the help.

cr

cremes · December 14, 2011, 2:04am

A couple quick observations.

I assume it’s expected that the majority of time is spent in
zmq_poll, which is an FFI Call to zmq. Excluding that from
consideration…
This is bad:

16.57 0.76 15.81 3079083 Class#new

Three million new classes created. I’m sure it’s singleton somewhere,
but that’s often a big perf sink…not necessarily because it impacts
straight-line code a lot, but because a class object is a nontrivial
thing to be creating three million of.

Going down the list, I don’t see anything obvious to optimize in
most of the hot methods. Usually when you have normalish-looking code
that’s really hot, the next thing to optimize is reducing object
allocation as much as possible. In fact, that more than anything often
improves perf on JRuby…but I’m not sure the effect on other impls.

Charlie

cremes · December 14, 2011, 9:55am

On Wed, Dec 14, 2011 at 9:04 AM, Charles Oliver N.
[email protected] wrote:

On Tue, Dec 13, 2011 at 8:03 PM, Chuck R. [email protected] wrote:

It will be very slow, but Sun/Oracle/OpenJDK has an object allocation
profiler built in. Examples:

jruby -J-Xrunhprof … # run with object allocation profiling,
tracking allocation call stacks 5 deep
jruby -J-Xrunhprof:depth=10 … # … call stacks 10 deep

That’s a full profiler. You can remedy the performance hit with option
cpu=samples significantly. The inaccuracy introduced by this is
probably negligible since hotspots will still be at the top. Results
can be analyzed with the free tool HPJMeter from HP.
http://www.javaperformancetuning.com/tools/hpjmeter/index.shtml

There is a more lightweight alternative for just looking at instance
counts and sizes per class:

$ jmap -histo

Process usually can be found via jps, ps or other means.

Kind regards

robert

cremes · December 14, 2011, 9:05am

On Tue, Dec 13, 2011 at 8:03 PM, Chuck R. [email protected]
wrote:

Three million new classes created. I’m sure it’s singleton somewhere,
but that’s often a big perf sink…not necessarily because it impacts
straight-line code a lot, but because a class object is a nontrivial
thing to be creating three million of.

Is there a way for JRuby to output more detail on those classes? For example,
can it tell me how many of each class type have been allocated? If so, I could use
that information to figure out the critical path and try to reduce the amount of
class creation there.
…
Also, I’m not certain what you mean by “I’m sure it’s singleton somewhere.” I
know what a singleton is, but I don’t understand your use of it in this context.
Elaborate, please.

Actually, it may not be a singleton object/class, now that I think
about it. Someone, somewhere is doing a Class.new { } (or similar) in
critical path code.

I’ll take a shot at reducing my object allocation. More information on how many
of each type of object is being allocated would be a wonderful statistic to have.
Also, if there were a way to trace that allocation back to a particular file and
line number (for those cases where the same object is allocated in multiple
places), that would be a killer profiling feature. <hint, hint>

Indeed it would. Luckily, the JVM already provides it.

It will be very slow, but Sun/Oracle/OpenJDK has an object allocation
profiler built in. Examples:

jruby -J-Xrunhprof … # run with object allocation profiling,
tracking allocation call stacks 5 deep
jruby -J-Xrunhprof:depth=10 … # … call stacks 10 deep

Running java -Xrunhprof:help will provide full help. As I say, it will
be slow but you don’t need to run a long benchmark; you’re looking
for aggregate numbers of objects for some number of iterations, which
won’t change drastically over a long run. (There’s also a full
instrumented performance profiler in “runhprof”, but it’s dead slow,
because it instruments every method in the system, including every JDK
class.)

The object allocation profiles will be dumped to a gigantic log file
called java.hprof.txt. The primary interesting bit of information is
at the end…a list of object types allocated plus numeric references
to the stack trace where they’re allocated. Another example:

Command:

$ jruby -J-Xrunhprof:depth=20 -e “1_000.times { ‘foo’ + ‘bar’ }”

A few elements of interest:

12 0.39% 14.11% 32000 1000 32000 1000 336225
org.jruby.RubyString
13 0.39% 14.51% 32000 1000 32000 1000 336226
org.jruby.RubyString
14 0.39% 14.90% 31968 999 31968 999 336239
org.jruby.RubyString

(Most of the top 20 allocated elements are byte[], which doesn’t tell
you much)

Note that almost 3000 RubyString objects have been allocated. 1000 of
those are ‘foo’, 1000 of them are ‘bar’, and 999 of them are the
combined string. It’s not exactly accurate because the script
terminates before the final allocation gets recorded.

The last number before the class name is the allocation trace. Looking
at 336225 (search in the same file):

TRACE 336225:
org.jruby.RubyBasicObject.(RubyBasicObject.java:219)
org.jruby.RubyObject.(RubyObject.java:94)
org.jruby.RubyString.(RubyString.java:396)
org.jruby.RubyString.(RubyString.java:424)
org.jruby.RubyString.newStringShared(RubyString.java:522)
org.jruby.ast.executable.RuntimeCache.getString(RuntimeCache.java:105)
org.jruby.ast.executable.AbstractScript.getString0(AbstractScript.java:165)
ruby.dash_e.block_0$RUBY$file(-e:1)
ruby$dash_e$block_0$RUBY$file.call(ruby$dash_e$block_0$RUBY$file:65535)
org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:112)
org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:95)
org.jruby.runtime.Block.yield(Block.java:130)
org.jruby.RubyFixnum.times(RubyFixnum.java:261)
org.jruby.RubyFixnum$INVOKER$i$0$0$times.call(RubyFixnum$INVOKER$i$0$0$times.gen:65535)
org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:302)
org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:144)
org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:153)
ruby.dash_e.file(-e:1)
ruby.dash_e.load(-e:Unknown line)
org.jruby.Ruby.runScript(Ruby.java:731)

This takes us all the way back to runScript in org.jruby.Ruby. The
-e:1 lines are our inline script. RubyFixnum.times is the “times”
call. And the top seven lines are the parts of JRuby called to
allocate a new literal String. Looking at the 336239 trace…

TRACE 336239:
org.jruby.RubyBasicObject.(RubyBasicObject.java:219)
org.jruby.RubyObject.(RubyObject.java:94)
org.jruby.RubyString.(RubyString.java:396)
org.jruby.RubyString.newString(RubyString.java:487)
org.jruby.RubyString.op_plus(RubyString.java:1039)
org.jruby.RubyString$INVOKER$i$1$0$op_plus.call(RubyString$INVOKER$i$1$0$op_plus.gen:65535)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:167)
ruby.dash_e.block_0$RUBY$file(-e:1)
ruby$dash_e$block_0$RUBY$file.call(ruby$dash_e$block_0$RUBY$file:65535)
org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:112)
org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:95)
org.jruby.runtime.Block.yield(Block.java:130)
org.jruby.RubyFixnum.times(RubyFixnum.java:261)
org.jruby.RubyFixnum$INVOKER$i$0$0$times.call(RubyFixnum$INVOKER$i$0$0$times.gen:65535)
org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:302)
org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:144)
org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:153)
ruby.dash_e.file(-e:1)
ruby.dash_e.load(-e:Unknown line)
org.jruby.Ruby.runScript(Ruby.java:731)

This is the String#plus call, org.jruby.RubyString.op_plus.

Incidentally, this is also where “class reification” comes in handy:

$ jruby -J-Xrunhprof:depth=20 -Xreify.classes=true -e “class Foo; end;
1_000.times { Foo.new }”

…

17 0.29% 15.50% 23616 984 23616 984 337103 rubyobj.Foo

TRACE 337103:
org.jruby.RubyBasicObject.(RubyBasicObject.java:219)
org.jruby.RubyObject.(RubyObject.java:94)
rubyobj.Foo.(:Unknown line)
sun.reflect.GeneratedConstructorAccessor4.newInstance(:Unknown line)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
java.lang.Class.newInstance0(Class.java:355)
java.lang.Class.newInstance(Class.java:308)
org.jruby.RubyClass$3.allocate(RubyClass.java:142)
org.jruby.RubyClass.allocate(RubyClass.java:223)
org.jruby.RubyClass.newInstance(RubyClass.java:809)
org.jruby.RubyClass$INVOKER$i$newInstance.call(RubyClass$INVOKER$i$newInstance.gen:65535)
org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroOrNBlock.call(JavaMethod.java:256)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:133)
ruby.dash_e.block_0$RUBY$file(-e:1)
ruby$dash_e$block_0$RUBY$file.call(ruby$dash_e$block_0$RUBY$file:65535)
org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:112)
org.jruby.runtime.CompiledBlock.yield(CompiledBlock.java:95)
org.jruby.runtime.Block.yield(Block.java:130)
org.jruby.RubyFixnum.times(RubyFixnum.java:261)

runhprof is a very simple way to get allocation info, and I’ve used it
many times to trim allocation overhead.

If you find this information useful, perhaps you can reformat it for a
JRuby wiki page

Charlie

cremes · December 14, 2011, 2:48pm

On Dec 14, 2011, at 2:04 AM, Charles Oliver N. wrote:

Actually, it may not be a singleton object/class, now that I think
about it. Someone, somewhere is doing a Class.new { } (or similar) in
critical path code.

Okay, so I should look for lines where I execute “Class.new {}” in my
code. Off the top of my head, it is rare that I call Class.new
directly; it’s almost always via an actual class constant name.

I’ll take a shot at reducing my object allocation. More information on how many
of each type of object is being allocated would be a wonderful statistic to have.
Also, if there were a way to trace that allocation back to a particular file and
line number (for those cases where the same object is allocated in multiple
places), that would be a killer profiling feature. <hint, hint>

Indeed it would. Luckily, the JVM already provides it.
[snip]
If you find this information useful, perhaps you can reformat it for a
JRuby wiki page

I find it very useful and I will add it to the wiki. Thanks for
taking the time to post this!

cr

cremes · December 14, 2011, 10:51am

On Wed, Dec 14, 2011 at 2:54 AM, Robert K.
[email protected] wrote:

jruby -J-Xrunhprof … # run with object allocation profiling,
tracking allocation call stacks 5 deep
jruby -J-Xrunhprof:depth=10 … # … call stacks 10 deep

That’s a full profiler. You can remedy the performance hit with option
cpu=samples significantly. The inaccuracy introduced by this is
probably negligible since hotspots will still be at the top. Results
can be analyzed with the free tool HPJMeter from HP.
Tool Report: HPjmeter

For CPU, yes. This is not CPU profiling, however, and runhprof does
not have a sampling mode for allocations.

VisualVM, on the other hand, does have such a mode. It’s usually my
second tool of choice for allocation analysis.

And of course you’re right…there’s many, many other tools for JVM
that can give some of the same and some different information. Most
work just dandy with JRuby.

There is a more lightweight alternative for just looking at instance
counts and sizes per class:

$ jmap -histo

Indeed. jmap is very useful for getting a “moment in time” snapshot of
in-memory objects.

I’ve covered several of these tools in my memory-related series of blog
posts:

And similar (possibly duplicate) posts on EY blog:

http://www.engineyard.com/blog/2010/monitoring-memory-with-jruby-part-1-jhat-and-visualvm/

http://www.engineyard.com/blog/2010/montoring-memory-with-jruby-part-2-eclipse-memory-analyzer/

Tip of the iceberg.

Charlie

cremes · December 15, 2011, 11:28pm

For those of you following along, I was able to shave off another 50
microseconds per client. I learned a few things along the way.

Most Ruby runtimes cannot optimize a call to #super with implicit
arguments very well.

e.g.
def foo(a, b, c)
super # <- slow because it has to figure out to pass (a,b,c)
a + b + c
end

It is more efficient to pass the specific arguments that the
super-method should receive. If it should not receive any, use #super().
This let’s the runtime avoid the work of interrogating the current
method for its arguments and remarshalling them for the super-method.

When using FFI and FFI::Struct, it is not efficient to call
MyStruct.size on a regular basis. The FFI library has to recalculate the
size of the structure each time and return that value. Stash the size in
an ivar somewhere and use that instead of calling #size.
When storing a few values, it’s sometimes cheap & easy to just use a
Hash. But in hot code, the work to look the element up is sometimes not
worth it. If you are only storing a few values, use an ivar.
Constants in Ruby are (unfortunately) not really constant. You can
change the value of a constant though the runtime will kick out a
warning. Since the Constants aren’t constant, each access of a constant
can incur a guarded lookup. The guard is there to see if the value
referred to by the constant has changed since the last access. Again, in
hot code this extra check can add up particularly if you are using lots
of constants. Again, just stash the value in an ivar where lookup is
fast.
As always, there is no faster code than “no code.” I found a few
spots where I was recalculating the same value 3 times in a hot path.
The calculation was fairly cheap, but when that hot path is called
millions of times even the cheap calcs can be expensive.

@value ||=
cheap_when_called_once_but_expensive_when_called_millions_of_times

That does the trick. Wow, got to love those ivars.

In hot code when looking up a value in an array, it’s a tad faster to
use #at instead of #[]. The reason is that #[] can take different
arguments other than a simple index (like a Range). Those extra argument
checks can add up.

These are all micro-optimizations and may not even be valid on your
runtime of choice. However, if you are running time critical code and
your lingua franca is Ruby, then some of these tricks can eek out a few
extra microseconds for you.

cr

cremes · December 16, 2011, 12:31am

Chuck R. [email protected] wrote:

In hot code when looking up a value in an array, it’s a tad faster
to use #at instead of #[]. The reason is that #[] can take different
arguments other than a simple index (like a Range). Those extra
argument checks can add up.

These are all micro-optimizations and may not even be valid on your
runtime of choice.

Exactly.

Array#at shouldn’t be faster than Array#[] on MRI 1.9+ with a single arg
(ref: insns.def).

cremes · December 14, 2011, 5:09pm

On Wed, Dec 14, 2011 at 7:47 AM, Chuck R. [email protected]
wrote:

Okay, so I should look for lines where I execute “Class.new {}” in my code. Off
the top of my head, it is rare that I call Class.new directly; it’s almost
always via an actual class constant name.

They’ll look something like this:

4  1.57% 42.08%    159840  999    159840   999 325078

org.jruby.MetaClass
5 1.57% 43.64% 159840 999 159840 999 325074
org.jruby.RubyClass

The MetaClass is the class’s metaclass, and RubyClass is the class
itself.

Charlie

cremes · December 19, 2011, 12:01am

On Thu, Dec 15, 2011 at 5:30 PM, Eric W. [email protected]
wrote:

Array#at shouldn’t be faster than Array#[] on MRI 1.9+ with a single arg
(ref: insns.def).

Indeed. Array#[] is slower on JRuby because String#[] uses backref
($~), and since we can’t tell at compile time whether we’re calling
Array#[] or String#[], we always prepare space for backref. There’s no
#at method that uses backref, so we can elide the allocation of space
for backref.

I plan to reduce the cost of backref allocation, but it’s not there
yet. In any case, JRuby’s quite fast either way.

Charlie

cremes · January 29, 2012, 8:21pm

On Dec 14, 2011, at 7:47 AM, Chuck R. wrote:

JRuby wiki page

I find it very useful and I will add it to the wiki. Thanks for taking the
time to post this!

It took me a while to get around to it, but I added it to the wiki.
Check it out here:

cr