Benchmark with InvokeDynamic, help me figure out why it's slow

addis_a · August 28, 2013, 10:07am

Hi,

I’ve benchmarked a pure Ruby MessagePack decoder with and without
InvokeDynamic, and with it enabled the code runs around 4x slower. Can
you
guys help me figure out why, or at least have a look and see if there’s
anything useful for optimizing InvokeDynamic in the future?

gist.github.com

https://gist.github.com/iconara/6363110

Explanation.md

For fun I'm writing a MessagePack encoder/decoder in pure Ruby to see how close to the speed of the native C and Java implementations I can get. Currently it's just 20% slower than the native C extension for decoding (haven't implemented encoding yet), and twice as slow as native Java, when running in the latest Java 1.7.

Below are some benchmarks run in both MRI and JRuby, comparing the native decoders for both platforms with my pure Ruby implementation with and without InvokeDynamic in JRuby.

The MessagePack decoder code is here: https://github.com/iconara/msgpack-pure-ruby/blob/647bcf74f8b059e5e31cc620f81eeb39b73393a4/lib/msgpack.rb

The benchmark code is here: https://github.com/iconara/msgpack-pure-ruby/blob/647bcf74f8b059e5e31cc620f81eeb39b73393a4/spec/benchmark.rb

The benchmark outputs start with the output of `java -version` and then runs a series of benchmarks with one million repetitions of decoding the same chunk of bytes. The lines beginning with `*` show the Ruby version used and additional info.

This file has been truncated. show original

Java 1.6.0u51

java version "1.6.0_51"
Java(TM) SE Runtime Environment (build 1.6.0_51-b11-457-11M4509)
Java HotSpot(TM) 64-Bit Server VM (build 20.51-b01-457, mixed mode)


* jruby-1.7.4 native java msgpack

Rehearsal ----------------------------------------------
msgpack      7.150000   0.220000   7.370000 (  6.993000)
------------------------------------- total: 7.370000sec

This file has been truncated. show original

Java 1.7.0_40-b40

java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b40)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b55, mixed mode)


* jruby-1.7.4 native java msgpack

Rehearsal ----------------------------------------------
msgpack      6.080000   0.160000   6.240000 (  3.626000)
------------------------------------- total: 6.240000sec

This file has been truncated. show original

There are more than three files. show original

With InvokeDynamic off the pure Ruby decoder is only 2.5x slower than
native Java, and on par with the same code running in MRI 2.0. It’s 4x
slower than the C extension, though.

yours,
Theo

Theo_Hultberg · August 28, 2013, 5:43pm

Theo,

I dont know about InvokeDynamic, but with my work trying to optimize
MDArray it seems that Proc’s and probably lambdas are very slow. I did
substitute all Procs with classes with a single apply method. This
gained
over 40% performance improvement. I dont know if this will add anything
to MessagePack.

Cheers,

Rodrigo

Theo_Hultberg · August 29, 2013, 1:32pm

Just a general question, but do you let the code “warm up” - e.g. run a
few iterations before benchmarking? That’s important for the JIT.

Theo_Hultberg · August 29, 2013, 9:44am

Thanks Rodrigo, I guess I can try that as an optimization in itself, but
since the code runs as fast as MRI as it is and it’s just InDy that’s
slow
I think there’s something else going on here.

T#

On Wed, Aug 28, 2013 at 5:39 PM, Rodrigo B. <

Theo_Hultberg · August 29, 2013, 2:54pm

Tobias: yes, the tests ran a warmup sequence, that’s not the issue (and
with one million repetitions it wouldn’t explain a consistent 4x
slowdown).

T#

Theo_Hultberg · September 1, 2013, 1:46am

This could be a bug in indy logic. I’m going to look into it, but for
future reference…

Usually a severe perf degradation with indy on means that it’s
rebinding some call sites over and over again. You can see how it is
binding (or giving up on binding) indy call sites by passing
-Xinvokedynamic.log.binding=true. There are other properties for
monitoring variable bindings, constants, etc.

Also, a --sample will often show that large numbers of method handles
(or lambda forms in u40+) are being created in this case.

I’ll use these with your benchmark and report back what I find out.

Charlie

Theo_Hultberg · September 1, 2013, 1:55am

So the good and bad news is that I was unable to reproduce what you
saw. Indy enabled on latest u40 build is slightly faster than indy
disabled. I’ve also confirmed it’s actually using indy:

system ~/projects/jruby $ jruby -Xcompile.invokedynamic=false
benchmark.rb
user system total real
msgpack 7.180000 0.090000 7.270000 ( 5.947000)
user system total real
msgpack 5.170000 0.020000 5.190000 ( 5.118000)
user system total real
msgpack 5.060000 0.020000 5.080000 ( 5.018000)
user system total real
msgpack 5.260000 0.020000 5.280000 ( 5.211000)
user system total real
msgpack 5.290000 0.020000 5.310000 ( 5.190000)
user system total real
msgpack 5.170000 0.020000 5.190000 ( 5.097000)
user system total real
msgpack 5.140000 0.020000 5.160000 ( 5.073000)
user system total real
msgpack 5.070000 0.020000 5.090000 ( 5.034000)
user system total real
msgpack 5.120000 0.010000 5.130000 ( 5.072000)
user system total real
msgpack 5.080000 0.020000 5.100000 ( 5.049000)

system ~/projects/jruby $ jruby benchmark.rb
user system total real
msgpack 8.610000 0.120000 8.730000 ( 6.388000)
user system total real
msgpack 4.670000 0.020000 4.690000 ( 4.606000)
user system total real
msgpack 4.740000 0.030000 4.770000 ( 4.636000)
user system total real
msgpack 4.770000 0.010000 4.780000 ( 4.713000)
user system total real
msgpack 4.870000 0.030000 4.900000 ( 4.783000)
user system total real
msgpack 4.900000 0.010000 4.910000 ( 4.831000)
user system total real
msgpack 4.820000 0.010000 4.830000 ( 4.749000)
user system total real
msgpack 4.790000 0.020000 4.810000 ( 4.743000)
user system total real
msgpack 4.810000 0.010000 4.820000 ( 4.763000)
user system total real
msgpack 4.910000 0.020000 4.930000 ( 4.824000)

I get a small improvement by forcing everything to compile up front,
which sometimes picks up code that wouldn’t normally JIT (like bare
lambdas not in a method body):

system ~/projects/jruby $ jruby -X+C benchmark.rb
user system total real
msgpack 8.360000 0.090000 8.450000 ( 6.403000)
user system total real
msgpack 4.770000 0.020000 4.790000 ( 4.678000)
user system total real
msgpack 4.620000 0.020000 4.640000 ( 4.557000)
user system total real
msgpack 4.690000 0.030000 4.720000 ( 4.605000)
user system total real
msgpack 4.620000 0.010000 4.630000 ( 4.564000)
user system total real
msgpack 4.700000 0.010000 4.710000 ( 4.595000)
user system total real
msgpack 4.740000 0.020000 4.760000 ( 4.691000)
user system total real
msgpack 4.760000 0.020000 4.780000 ( 4.697000)
user system total real
msgpack 4.700000 0.010000 4.710000 ( 4.650000)
user system total real
msgpack 4.710000 0.010000 4.720000 ( 4.649000)

Finally, I get a much larger improvement by enabling tiered
compilation at the JVM level, which is supposed to get turned on in
some upcoming build. Tiered compilation basically does additional
profiling after a first simple JIT to do a better job on the second
optimizing JIT.

system ~/projects/jruby $ jruby -X+C -J-XX:+TieredCompilation
benchmark.rb
user system total real
msgpack 6.620000 0.140000 6.760000 ( 4.211000)
user system total real
msgpack 3.570000 0.020000 3.590000 ( 3.404000)
user system total real
msgpack 3.390000 0.010000 3.400000 ( 3.341000)
user system total real
msgpack 3.520000 0.020000 3.540000 ( 3.417000)
user system total real
msgpack 3.500000 0.010000 3.510000 ( 3.399000)
user system total real
msgpack 3.410000 0.020000 3.430000 ( 3.384000)
user system total real
msgpack 3.450000 0.010000 3.460000 ( 3.388000)
user system total real
msgpack 3.430000 0.010000 3.440000 ( 3.382000)
user system total real
msgpack 3.380000 0.010000 3.390000 ( 3.329000)
user system total real
msgpack 3.440000 0.010000 3.450000 ( 3.379000

Given the same flags, this is less then 2x slower than the JRuby ext
version:

system ~/projects/jruby $ jruby -X+C -J-XX:+TieredCompilation
benchmark.rb
user system total real
msgpack 5.030000 0.120000 5.150000 ( 2.475000)
user system total real
msgpack 1.950000 0.010000 1.960000 ( 1.886000)
user system total real
msgpack 1.920000 0.010000 1.930000 ( 1.845000)
user system total real
msgpack 1.930000 0.010000 1.940000 ( 1.846000)
user system total real
msgpack 1.900000 0.010000 1.910000 ( 1.844000)
user system total real
msgpack 1.930000 0.020000 1.950000 ( 1.864000)
user system total real
msgpack 1.890000 0.010000 1.900000 ( 1.833000)
user system total real
msgpack 1.890000 0.010000 1.900000 ( 1.838000)
user system total real
msgpack 1.910000 0.010000 1.920000 ( 1.846000)
user system total real
msgpack 1.920000 0.010000 1.930000 ( 1.867000)

My Java version:

java version “1.7.0_40”
Java™ SE Runtime Environment (build 1.7.0_40-b40)
Java HotSpot™ 64-Bit Server VM (build 24.0-b55, mixed mode)

And of course I’m running on master.

Charlie

On Sat, Aug 31, 2013 at 6:44 PM, Charles Oliver N.

Theo_Hultberg · September 1, 2013, 1:29pm

Yes, I was doing something wrong. To help decreasing startup time I
had -J-XX:TieredStopAtLevel=1 in my JRUBY_OPTS.

Really, really stupid of me to run benchmarks without clearing
JRUBY_OPTS.

The good news is that my pure Ruby MessagePack decoder runs less than
30%
slower in JRuby as the native C library does in MRI. I’ve updated the
gist
with the latest numbers.

Thanks for helping out, and sorry for wasting your time.

T#

Theo_Hultberg · September 1, 2013, 12:41pm

Thanks for taking your time looking into this Charlie. I must be doing
something wrong, because even after updating to u40-b40, running with
jruby-head (from RVM, SHA 9422f547) and running with -X+C
-J-XX:+TieredCompilation InDy is an order or magnitude slower for me.

When I add -Xinvokedynamic.log.binding=true it logs messages from
InvocationLinker for a second or so, then it’s quiet during the main run
of
the benchmark and prints a few lines at the end. The output says it’s
binding some of the methods in my code.

I’ve updated the gist with the numbers (
Pure Ruby MessagePack decoding speed · GitHub)

T#

On Sun, Sep 1, 2013 at 1:54 AM, Charles Oliver N.

Theo_Hultberg · September 2, 2013, 6:30pm

Ahh well good to know that tier 1 does not optimize invokedynamic very
well

Charlie