Forum: JRuby Benchmark with InvokeDynamic, help me figure out why it's slow

8c21306523b16ba5dd35c3549bf90994?d=identicon&s=25 Theo Hultberg (Guest)
on 2013-08-28 10:07
(Received via mailing list)
Hi,

I've benchmarked a pure Ruby MessagePack decoder with and without
InvokeDynamic, and with it enabled the code runs around 4x slower. Can
you
guys help me figure out why, or at least have a look and see if there's
anything useful for optimizing InvokeDynamic in the future?

https://gist.github.com/iconara/6363110

With InvokeDynamic off the pure Ruby decoder is only 2.5x slower than
native Java, and on par with the same code running in MRI 2.0. It's 4x
slower than the C extension, though.

yours,
Theo
B97f192e676f1a67e0ef18f022671406?d=identicon&s=25 Rodrigo Botafogo (Guest)
on 2013-08-28 17:43
(Received via mailing list)
Theo,

I dont know about InvokeDynamic, but with my work trying to optimize
MDArray it seems that Proc's and probably lambdas are very slow.  I did
substitute all Procs with classes with a single apply method.  This
gained
over 40% performance improvement.  I dont know if this will add anything
to MessagePack.

Cheers,


Rodrigo
8c21306523b16ba5dd35c3549bf90994?d=identicon&s=25 Theo Hultberg (Guest)
on 2013-08-29 09:44
(Received via mailing list)
Thanks Rodrigo, I guess I can try that as an optimization in itself, but
since the code runs as fast as MRI as it is and it's just InDy that's
slow
I think there's something else going on here.

T#


On Wed, Aug 28, 2013 at 5:39 PM, Rodrigo Botafogo <
8db112f19ce0a69c7e490bbf27256346?d=identicon&s=25 Tobias Pfeiffer (Guest)
on 2013-08-29 13:32
(Received via mailing list)
Just a general question, but do you let the code "warm up" - e.g. run a
few iterations before benchmarking? That's important for the JIT.
8c21306523b16ba5dd35c3549bf90994?d=identicon&s=25 Theo Hultberg (Guest)
on 2013-08-29 14:54
(Received via mailing list)
Tobias: yes, the tests ran a warmup sequence, that's not the issue (and
with one million repetitions it wouldn't explain a consistent 4x
slowdown).

T#
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles Nutter (headius)
on 2013-09-01 01:46
(Received via mailing list)
This could be a bug in indy logic. I'm going to look into it, but for
future reference...

Usually a severe perf degradation with indy on means that it's
rebinding some call sites over and over again. You can see how it is
binding (or giving up on binding) indy call sites by passing
-Xinvokedynamic.log.binding=true. There are other properties for
monitoring variable bindings, constants, etc.

Also, a --sample will often show that large numbers of method handles
(or lambda forms in u40+) are being created in this case.

I'll use these with your benchmark and report back what I find out.

- Charlie
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles Nutter (headius)
on 2013-09-01 01:55
(Received via mailing list)
So the good and bad news is that I was unable to reproduce what you
saw. Indy enabled on latest u40 build is slightly faster than indy
disabled. I've also confirmed it's actually using indy:

system ~/projects/jruby $ jruby -Xcompile.invokedynamic=false
benchmark.rb
                 user     system      total        real
msgpack      7.180000   0.090000   7.270000 (  5.947000)
                 user     system      total        real
msgpack      5.170000   0.020000   5.190000 (  5.118000)
                 user     system      total        real
msgpack      5.060000   0.020000   5.080000 (  5.018000)
                 user     system      total        real
msgpack      5.260000   0.020000   5.280000 (  5.211000)
                 user     system      total        real
msgpack      5.290000   0.020000   5.310000 (  5.190000)
                 user     system      total        real
msgpack      5.170000   0.020000   5.190000 (  5.097000)
                 user     system      total        real
msgpack      5.140000   0.020000   5.160000 (  5.073000)
                 user     system      total        real
msgpack      5.070000   0.020000   5.090000 (  5.034000)
                 user     system      total        real
msgpack      5.120000   0.010000   5.130000 (  5.072000)
                 user     system      total        real
msgpack      5.080000   0.020000   5.100000 (  5.049000)

system ~/projects/jruby $ jruby benchmark.rb
                 user     system      total        real
msgpack      8.610000   0.120000   8.730000 (  6.388000)
                 user     system      total        real
msgpack      4.670000   0.020000   4.690000 (  4.606000)
                 user     system      total        real
msgpack      4.740000   0.030000   4.770000 (  4.636000)
                 user     system      total        real
msgpack      4.770000   0.010000   4.780000 (  4.713000)
                 user     system      total        real
msgpack      4.870000   0.030000   4.900000 (  4.783000)
                 user     system      total        real
msgpack      4.900000   0.010000   4.910000 (  4.831000)
                 user     system      total        real
msgpack      4.820000   0.010000   4.830000 (  4.749000)
                 user     system      total        real
msgpack      4.790000   0.020000   4.810000 (  4.743000)
                 user     system      total        real
msgpack      4.810000   0.010000   4.820000 (  4.763000)
                 user     system      total        real
msgpack      4.910000   0.020000   4.930000 (  4.824000)

I get a small improvement by forcing everything to compile up front,
which sometimes picks up code that wouldn't normally JIT (like bare
lambdas not in a method body):

system ~/projects/jruby $ jruby -X+C benchmark.rb
                 user     system      total        real
msgpack      8.360000   0.090000   8.450000 (  6.403000)
                 user     system      total        real
msgpack      4.770000   0.020000   4.790000 (  4.678000)
                 user     system      total        real
msgpack      4.620000   0.020000   4.640000 (  4.557000)
                 user     system      total        real
msgpack      4.690000   0.030000   4.720000 (  4.605000)
                 user     system      total        real
msgpack      4.620000   0.010000   4.630000 (  4.564000)
                 user     system      total        real
msgpack      4.700000   0.010000   4.710000 (  4.595000)
                 user     system      total        real
msgpack      4.740000   0.020000   4.760000 (  4.691000)
                 user     system      total        real
msgpack      4.760000   0.020000   4.780000 (  4.697000)
                 user     system      total        real
msgpack      4.700000   0.010000   4.710000 (  4.650000)
                 user     system      total        real
msgpack      4.710000   0.010000   4.720000 (  4.649000)

Finally, I get a much larger improvement by enabling tiered
compilation at the JVM level, which is *supposed* to get turned on in
some upcoming build. Tiered compilation basically does additional
profiling after a first simple JIT to do a better job on the second
optimizing JIT.

system ~/projects/jruby $ jruby -X+C -J-XX:+TieredCompilation
benchmark.rb
                 user     system      total        real
msgpack      6.620000   0.140000   6.760000 (  4.211000)
                 user     system      total        real
msgpack      3.570000   0.020000   3.590000 (  3.404000)
                 user     system      total        real
msgpack      3.390000   0.010000   3.400000 (  3.341000)
                 user     system      total        real
msgpack      3.520000   0.020000   3.540000 (  3.417000)
                 user     system      total        real
msgpack      3.500000   0.010000   3.510000 (  3.399000)
                 user     system      total        real
msgpack      3.410000   0.020000   3.430000 (  3.384000)
                 user     system      total        real
msgpack      3.450000   0.010000   3.460000 (  3.388000)
                 user     system      total        real
msgpack      3.430000   0.010000   3.440000 (  3.382000)
                 user     system      total        real
msgpack      3.380000   0.010000   3.390000 (  3.329000)
                 user     system      total        real
msgpack      3.440000   0.010000   3.450000 (  3.379000

Given the same flags, this is less then 2x slower than the JRuby ext
version:

system ~/projects/jruby $ jruby -X+C -J-XX:+TieredCompilation
benchmark.rb
                 user     system      total        real
msgpack      5.030000   0.120000   5.150000 (  2.475000)
                 user     system      total        real
msgpack      1.950000   0.010000   1.960000 (  1.886000)
                 user     system      total        real
msgpack      1.920000   0.010000   1.930000 (  1.845000)
                 user     system      total        real
msgpack      1.930000   0.010000   1.940000 (  1.846000)
                 user     system      total        real
msgpack      1.900000   0.010000   1.910000 (  1.844000)
                 user     system      total        real
msgpack      1.930000   0.020000   1.950000 (  1.864000)
                 user     system      total        real
msgpack      1.890000   0.010000   1.900000 (  1.833000)
                 user     system      total        real
msgpack      1.890000   0.010000   1.900000 (  1.838000)
                 user     system      total        real
msgpack      1.910000   0.010000   1.920000 (  1.846000)
                 user     system      total        real
msgpack      1.920000   0.010000   1.930000 (  1.867000)

My Java version:

java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b40)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b55, mixed mode)

And of course I'm running on master.

- Charlie

On Sat, Aug 31, 2013 at 6:44 PM, Charles Oliver Nutter
8c21306523b16ba5dd35c3549bf90994?d=identicon&s=25 Theo Hultberg (Guest)
on 2013-09-01 12:41
(Received via mailing list)
Thanks for taking your time looking into this Charlie. I must be doing
something wrong, because even after updating to u40-b40, running with
jruby-head (from RVM, SHA 9422f547) and running with -X+C
-J-XX:+TieredCompilation InDy is an order or magnitude slower for me.

When I add -Xinvokedynamic.log.binding=true it logs messages from
InvocationLinker for a second or so, then it's quiet during the main run
of
the benchmark and prints a few lines at the end. The output says it's
binding some of the methods in my code.

I've updated the gist with the numbers (
https://gist.github.com/iconara/6363110#file-java-...)

T#


On Sun, Sep 1, 2013 at 1:54 AM, Charles Oliver Nutter
8c21306523b16ba5dd35c3549bf90994?d=identicon&s=25 Theo Hultberg (Guest)
on 2013-09-01 13:29
(Received via mailing list)
Yes, I was doing something wrong. To help decreasing startup time I
had  -J-XX:TieredStopAtLevel=1 in my JRUBY_OPTS.

Really, really stupid of me to run benchmarks without clearing
JRUBY_OPTS.

The good news is that my pure Ruby MessagePack decoder runs less than
30%
slower in JRuby as the native C library does in MRI. I've updated the
gist
with the latest numbers.

Thanks for helping out, and sorry for wasting your time.

T#
F1d37642fdaa1662ff46e4c65731e9ab?d=identicon&s=25 Charles Nutter (headius)
on 2013-09-02 18:30
(Received via mailing list)
Ahh well good to know that tier 1 does not optimize invokedynamic very
well :-)

- Charlie
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.