Benchmark with InvokeDynamic, help me figure out why it's slow

Hi,

I’ve benchmarked a pure Ruby MessagePack decoder with and without
InvokeDynamic, and with it enabled the code runs around 4x slower. Can
you
guys help me figure out why, or at least have a look and see if there’s
anything useful for optimizing InvokeDynamic in the future?

With InvokeDynamic off the pure Ruby decoder is only 2.5x slower than
native Java, and on par with the same code running in MRI 2.0. It’s 4x
slower than the C extension, though.

yours,
Theo

Theo,

I dont know about InvokeDynamic, but with my work trying to optimize
MDArray it seems that Proc’s and probably lambdas are very slow. I did
substitute all Procs with classes with a single apply method. This
gained
over 40% performance improvement. I dont know if this will add anything
to MessagePack.

Cheers,

Rodrigo

Just a general question, but do you let the code “warm up” - e.g. run a
few iterations before benchmarking? That’s important for the JIT.

Thanks Rodrigo, I guess I can try that as an optimization in itself, but
since the code runs as fast as MRI as it is and it’s just InDy that’s
slow
I think there’s something else going on here.

T#

On Wed, Aug 28, 2013 at 5:39 PM, Rodrigo B. <

Tobias: yes, the tests ran a warmup sequence, that’s not the issue (and
with one million repetitions it wouldn’t explain a consistent 4x
slowdown).

T#

This could be a bug in indy logic. I’m going to look into it, but for
future reference…

Usually a severe perf degradation with indy on means that it’s
rebinding some call sites over and over again. You can see how it is
binding (or giving up on binding) indy call sites by passing
-Xinvokedynamic.log.binding=true. There are other properties for
monitoring variable bindings, constants, etc.

Also, a --sample will often show that large numbers of method handles
(or lambda forms in u40+) are being created in this case.

I’ll use these with your benchmark and report back what I find out.

  • Charlie

So the good and bad news is that I was unable to reproduce what you
saw. Indy enabled on latest u40 build is slightly faster than indy
disabled. I’ve also confirmed it’s actually using indy:

system ~/projects/jruby $ jruby -Xcompile.invokedynamic=false
benchmark.rb
user system total real
msgpack 7.180000 0.090000 7.270000 ( 5.947000)
user system total real
msgpack 5.170000 0.020000 5.190000 ( 5.118000)
user system total real
msgpack 5.060000 0.020000 5.080000 ( 5.018000)
user system total real
msgpack 5.260000 0.020000 5.280000 ( 5.211000)
user system total real
msgpack 5.290000 0.020000 5.310000 ( 5.190000)
user system total real
msgpack 5.170000 0.020000 5.190000 ( 5.097000)
user system total real
msgpack 5.140000 0.020000 5.160000 ( 5.073000)
user system total real
msgpack 5.070000 0.020000 5.090000 ( 5.034000)
user system total real
msgpack 5.120000 0.010000 5.130000 ( 5.072000)
user system total real
msgpack 5.080000 0.020000 5.100000 ( 5.049000)

system ~/projects/jruby $ jruby benchmark.rb
user system total real
msgpack 8.610000 0.120000 8.730000 ( 6.388000)
user system total real
msgpack 4.670000 0.020000 4.690000 ( 4.606000)
user system total real
msgpack 4.740000 0.030000 4.770000 ( 4.636000)
user system total real
msgpack 4.770000 0.010000 4.780000 ( 4.713000)
user system total real
msgpack 4.870000 0.030000 4.900000 ( 4.783000)
user system total real
msgpack 4.900000 0.010000 4.910000 ( 4.831000)
user system total real
msgpack 4.820000 0.010000 4.830000 ( 4.749000)
user system total real
msgpack 4.790000 0.020000 4.810000 ( 4.743000)
user system total real
msgpack 4.810000 0.010000 4.820000 ( 4.763000)
user system total real
msgpack 4.910000 0.020000 4.930000 ( 4.824000)

I get a small improvement by forcing everything to compile up front,
which sometimes picks up code that wouldn’t normally JIT (like bare
lambdas not in a method body):

system ~/projects/jruby $ jruby -X+C benchmark.rb
user system total real
msgpack 8.360000 0.090000 8.450000 ( 6.403000)
user system total real
msgpack 4.770000 0.020000 4.790000 ( 4.678000)
user system total real
msgpack 4.620000 0.020000 4.640000 ( 4.557000)
user system total real
msgpack 4.690000 0.030000 4.720000 ( 4.605000)
user system total real
msgpack 4.620000 0.010000 4.630000 ( 4.564000)
user system total real
msgpack 4.700000 0.010000 4.710000 ( 4.595000)
user system total real
msgpack 4.740000 0.020000 4.760000 ( 4.691000)
user system total real
msgpack 4.760000 0.020000 4.780000 ( 4.697000)
user system total real
msgpack 4.700000 0.010000 4.710000 ( 4.650000)
user system total real
msgpack 4.710000 0.010000 4.720000 ( 4.649000)

Finally, I get a much larger improvement by enabling tiered
compilation at the JVM level, which is supposed to get turned on in
some upcoming build. Tiered compilation basically does additional
profiling after a first simple JIT to do a better job on the second
optimizing JIT.

system ~/projects/jruby $ jruby -X+C -J-XX:+TieredCompilation
benchmark.rb
user system total real
msgpack 6.620000 0.140000 6.760000 ( 4.211000)
user system total real
msgpack 3.570000 0.020000 3.590000 ( 3.404000)
user system total real
msgpack 3.390000 0.010000 3.400000 ( 3.341000)
user system total real
msgpack 3.520000 0.020000 3.540000 ( 3.417000)
user system total real
msgpack 3.500000 0.010000 3.510000 ( 3.399000)
user system total real
msgpack 3.410000 0.020000 3.430000 ( 3.384000)
user system total real
msgpack 3.450000 0.010000 3.460000 ( 3.388000)
user system total real
msgpack 3.430000 0.010000 3.440000 ( 3.382000)
user system total real
msgpack 3.380000 0.010000 3.390000 ( 3.329000)
user system total real
msgpack 3.440000 0.010000 3.450000 ( 3.379000

Given the same flags, this is less then 2x slower than the JRuby ext
version:

system ~/projects/jruby $ jruby -X+C -J-XX:+TieredCompilation
benchmark.rb
user system total real
msgpack 5.030000 0.120000 5.150000 ( 2.475000)
user system total real
msgpack 1.950000 0.010000 1.960000 ( 1.886000)
user system total real
msgpack 1.920000 0.010000 1.930000 ( 1.845000)
user system total real
msgpack 1.930000 0.010000 1.940000 ( 1.846000)
user system total real
msgpack 1.900000 0.010000 1.910000 ( 1.844000)
user system total real
msgpack 1.930000 0.020000 1.950000 ( 1.864000)
user system total real
msgpack 1.890000 0.010000 1.900000 ( 1.833000)
user system total real
msgpack 1.890000 0.010000 1.900000 ( 1.838000)
user system total real
msgpack 1.910000 0.010000 1.920000 ( 1.846000)
user system total real
msgpack 1.920000 0.010000 1.930000 ( 1.867000)

My Java version:

java version “1.7.0_40”
Java™ SE Runtime Environment (build 1.7.0_40-b40)
Java HotSpot™ 64-Bit Server VM (build 24.0-b55, mixed mode)

And of course I’m running on master.

  • Charlie

On Sat, Aug 31, 2013 at 6:44 PM, Charles Oliver N.

Yes, I was doing something wrong. To help decreasing startup time I
had -J-XX:TieredStopAtLevel=1 in my JRUBY_OPTS.

Really, really stupid of me to run benchmarks without clearing
JRUBY_OPTS.

The good news is that my pure Ruby MessagePack decoder runs less than
30%
slower in JRuby as the native C library does in MRI. I’ve updated the
gist
with the latest numbers.

Thanks for helping out, and sorry for wasting your time.

T#

Thanks for taking your time looking into this Charlie. I must be doing
something wrong, because even after updating to u40-b40, running with
jruby-head (from RVM, SHA 9422f547) and running with -X+C
-J-XX:+TieredCompilation InDy is an order or magnitude slower for me.

When I add -Xinvokedynamic.log.binding=true it logs messages from
InvocationLinker for a second or so, then it’s quiet during the main run
of
the benchmark and prints a few lines at the end. The output says it’s
binding some of the methods in my code.

I’ve updated the gist with the numbers (
Pure Ruby MessagePack decoding speed · GitHub)

T#

On Sun, Sep 1, 2013 at 1:54 AM, Charles Oliver N.

Ahh well good to know that tier 1 does not optimize invokedynamic very
well :slight_smile:

  • Charlie