Huge performance gap

alexisrichardson · July 3, 2006, 9:57am

2006/7/2, M. Edward (Ed) Borasky [email protected]:

disposal. Real machines are pretty smart too, at least the ones from
Intel are.

True, they do quite a lot of smart things nowadays. I feel however
that optimizing on a higher level of abstraction can yield better
improvements (i.e. removing an operation from a loop vs. just making
it as fast as possible).

The point of my comment was the emphasis on statistical
properties of applications. Since this is the area I’ve spent quite a
bit of time in, it’s a more natural approach to me than, say, the
niceties of discrete math required to design an optimizing compiler or
interpreter.

Well, VM also use statistical data - but that’s derived from a
different set of data points.

In the end, most of the “interesting” discrete math problems in
optimization are either totally unsolvable or NP complete, and you end
up making statistical / probabalistic compromises anyhow. You end up
solving problems you can solve for people who behave reasonably
rationally, and you try to design your hardware, OS, compilers,
interpreters and languages so rational behavior is rewarded with
satisfactory performance, not necessarily optimal performance. And you
try to design so that irrational behavior is detected and prevented from
injuring the rational people.

Agree.

Don’t get me wrong, the Sun Intel x86 JVM is a marvelous piece of
software engineering. Considering how many person-years of tweaking it’s
had, that’s not surprising. But the original goal of Java and the
reason for using a VM was “write once, run anywhere”. “Anywhere” no
longer includes the Alpha, and may have never included the MIPS or
HP-PARISC. IIRC “anywhere” no longer includes MacOS. And since I’ve
never tested it, I don’t know for a fact that the Solaris/SPARC version
of the JVM is as highly tuned as the Intel one.

I once had a link to an article that came from Sun development where
they claimed that their Solaris JVM is too bad compared with the
Windows version… Unfortunately I cannot dig it up at the moment.

To bring this back to Ruby, my recommendations stand:

We’re probably less far away from each other than it seemed:

Focus on building a smart(er) interpreter rather than an extra
virtual machine layer.

I don’t care what it’s called or whether it uses bytecode or what not.
My basic point was that a runtime environment (aka VM aka interpreter)
is a good architecture because it provides better options for runtime
optimization.

Focus optimizations on the Intel x68 and x86-64 architectures for the
“community” projects. Leverage off of GCC for all platforms; i.e.,
don’t use Microsoft’s compilers on Windows.

I can’t comment on MS compiler vs. GCC - all I’ve heard in the past is
that some compilers yield better performance characteristics than
others so the platform’s native compiler seems to have an edge there.

And don’t be afraid of a
little assembler code. It works for Linux, it works for ATLAS
(Automatically Tuned Linear Algebra Subroutines) and I suspect there’s
some in the Sun JVM.

Yes.

Focus on Windows, Linux and MacOS for complete Ruby environments for
the “community” projects.

Sounds reasonable. For more server oriented apps Solaris might be an
option, too. But I have the feeling that it’s on the decline…

Kind regards

robert

alexisrichardson · July 3, 2006, 5:55pm

On Mon, Jul 03, 2006 at 10:31:55AM +0900, Francis C. wrote:

Ah … now searching is something we can optimize!

Searching can be improved but even so, it’s a lot of work to do at runtime.
Languages that treat method dispatch as lookups into indexed tables have a
big edge. Even Python does this.

Lisp (CLOS) has an even more complicated method dispatch than Ruby,
since it may have to search up the parent classes of all parameters
(multi dispatch) and MI is allowed. History shows this type of method
dispatch can be highly optimized and be made very performant.

It sure took some time for Lisp to reach this stage though.

-JÃ¼rgen

alexisrichardson · July 3, 2006, 5:59pm

On Mon, Jul 03, 2006 at 04:57:53AM +0900, Robert M. wrote:

What tools exist for profiling Ruby?

You definitely want ruby-prof
(http://rubyforge.org/projects/ruby-prof/),
but this has little to do with Ruby’s performance or how to make its
implementation faster in general…

alexisrichardson · July 3, 2006, 6:09pm

On Tue, Jul 04, 2006 at 12:53:09AM +0900, Juergen S. wrote:

(multi dispatch) and MI is allowed. History shows this type of method
dispatch can be highly optimized and be made very performant.

Ruby’s method cache hit rate was over 98% IIRC so full searches are
relatively
rare, but… method dispatching is still fairly slow. I’d bet YARV will
use
inline method caches at some point (it did already have ICs for constant
lookup last time I read the sources)

alexisrichardson · July 4, 2006, 5:13am

Juergen S. wrote:

Lisp (CLOS) has an even more complicated method dispatch than Ruby,
since it may have to search up the parent classes of all parameters
(multi dispatch) and MI is allowed. History shows this type of method
dispatch can be highly optimized and be made very performant.

It sure took some time for Lisp to reach this stage though.

And most Lisp processors have the ability to compile to machine code.

–
M. Edward (Ed) Borasky

alexisrichardson · July 4, 2006, 5:40am

Hello M.,

MEEB> Juergen S. wrote:

Lisp (CLOS) has an even more complicated method dispatch than Ruby,
since it may have to search up the parent classes of all parameters
(multi dispatch) and MI is allowed. History shows this type of method
dispatch can be highly optimized and be made very performant.

It sure took some time for Lisp to reach this stage though.

MEEB> And most Lisp processors have the ability to compile to machine
code.

Well, but lisp compiled machine code is crashing on the simplest error
because to get speed everything is casted to void*.

And it also needs - at least the implementations i know - a sealed
universe
to optimize the method dispatching. I also don’t know about any
transparent
JIT compiler that cleanly handles eval and recompiles the necessary
parts
on its own (but my last look at LISP was 2003). AFAIK all lisps need
manual help from the programmer here.

So we are not really talking about the same thing here.

alexisrichardson · July 3, 2006, 6:13pm

On 7/3/06, Juergen S. [email protected] wrote:

Lisp (CLOS) has an even more complicated method dispatch than Ruby,
since it may have to search up the parent classes of all parameters
(multi dispatch) and MI is allowed. History shows this type of method
dispatch can be highly optimized and be made very performant.

I think about this quite a lot. I’ve never implemented LISP or LISP-like
so
I’m not really qualified to speak, but I have implemented some other
lambda-based languages (ML, parts of Haskell). One of the huge issues is
tail-recursion elimination. The runtime environment of these languages
is
fundamentally about mapping functionality to lists, and it you can do
that
without building a stack frame for each function-application, you win
big.
This doesn’t apply at all to Ruby, which is an Algol-derivative.

I’ve avoided asking this question because I assume that all the rest of
you
have already clawed through the Ruby interpreter line by line, squeezing
cycles out of it.
I have an unproven hunch that it spends much of its time doing hash
lookups,
in place of the work that other languages do by de-indexing
function-pointer
tables. If so, that’s the thing to optimize. Any comments?

alexisrichardson · July 5, 2006, 2:23am

Charles O Nutter wrote:

On 7/1/06, [email protected] [email protected] wrote:

how is this performance data available significantly different
from that made transparent by gcc/gprof/gdb/dmalloc/etc -
gcc can encode plenty of information for tools like these
to dump reams of info at runtime.
…
Ahhh, venturing into a domain I love talking about.

Runtime-modification of code is exactly what sets the JVM apart from static
compilation and optimization in something like GCC.

An even more interesting example of what JIT compilers
can do is “dynamic deoptimization” [1,2].

A JIT compiler can optimistically perform some very aggressive
optimizations and undo them later on if something (like a new
class being loaded, or a dynamic class modified) would make
these optimizations no longer valid.

One example is inlining of virtual methods or methods from
dynamic classes.

A VM can optimistically treat virtual function calls as normal
calls and even inline them – and later if it notices that
someone did create a derived class or modify the dynamic
class it could de-optimize those function calls when the
class is modified.

I think this could be extremely interesting in as dynamic
a language as Ruby.

Ron

[1] http://research.sun.com/self/papers/dynamic-deoptimization.html
[2]
http://portal.acm.org/citation.cfm?id=143114&dl=ACM&coll=portal&CFID=15151515&CFTOKEN=6184618

alexisrichardson · July 5, 2006, 7:28am

2006/7/5, Ron M [email protected]:

An even more interesting example of what JIT compilers
can do is “dynamic deoptimization” [1,2].

I think this could be extremely interesting in as dynamic
a language as Ruby.

Although I too find this very amazing I’m not so sure about the
“extreme” in the case of Ruby. Since Ruby is a whole lot more dynamic
than Java, too much of this optimization deoptimization might occur
and thus degrade performance. It’s a tradeoff - as always.

Kind regards

robert

alexisrichardson · May 12, 2008, 8:54am

Bill K. wrote:

But yes, it’s harder to make a language like Ruby, which is highly
dynamic at runtime, fast like C++ and Java, which are primarily
statically compiled. The Smalltalk folks have reportedly done pretty
well though, so there exists the possiblilty that Ruby may get
substantially faster in the future. YARV is already making some
headway.

My question is what does 1.9 exactly do with its “Inline (Method)
cache”[1]? Is there room for more improvement with a better JIT
compiler? How much would this help scripts?
Thanks!
-R
[1] http://www.atdot.net/yarv/rc2006_sasada_yarv_on_rails.pdf

alexisrichardson · July 5, 2006, 7:45pm

Robert K. wrote:

2006/7/5, Ron M [email protected]:

An even more interesting example of what JIT compilers
can do is “dynamic deoptimization” [1,2].

Although I too find this very amazing

IBM has an even better description of the technique here
where they show nice examples of dynamic deoptimization
in Java with some benchmarking.
http://www-128.ibm.com/developerworks/library/j-jtp12214/
They also have a great conclusion:
“So, what can we conclude from this “benchmark?” Virtually
nothing, except that benchmarking dynamically compiled
languages is much more subtle than you might think.”

I’m not so sure about the
“extreme” in the case of Ruby. Since Ruby is a whole lot more dynamic
than Java, too much of this optimization deoptimization might occur
and thus degrade performance.

Well, it’s great for long running programs (Rails) and bad for
short-lived ones. For long running programs you’ll end up in
a steady-state where all the methods someone never changes can
be inlined and all the methods someone changes in subclasses
aren’t. I have web servers that have been running for 3 years.
Surely most classes that did get subclassed or dynamically
modified would have done so in the first few months.

For example, if someone never touches 90% of the methods
in String or Array in a Rails application, it would help
quite a bit to apply every optimization technique known
including inlining to those.

It’s a tradeoff - as always.

In this case I think it could be made to always be a win.

For example - don’t apply the aggressive optimizations
until after some period of time (a minute?, a week?,
a month?) after the program started running.

Of course a simpler way is what Java apparently does – provide
a runtime switch to indicate that something’s a long running
process. I believe the way the enable the technique is with
the server vs client versions of their VM. Here’s[1] how Sun
describes it:

alexisrichardson · May 12, 2008, 3:35pm

On May 12, 7:54 am, Roger P. [email protected] wrote:

compiler? How much would this help scripts?
There’s a huge amount of room for improvement in the still. I haven’t
seen any of the Ruby implementations even try to apply more
sophisticated
VM/JIT techniques such as tracing and polymorphic inline caches
properly.

There’s nothing inherently in Ruby preventing “near C” performance (at
least within the same magnitude, but probably a lot closer), though
there
are lots of things that make it a lot of work to get there (the level
of dynamism, certainly).

Even thought Ruby programs in theory are extremely dynamic, most paths
through a program will be heavily dominated by the same types over and
over, and that can be exploited to massively reduce overhead, with
some
pretty cheap checks to shunt execution over to a fallback if certain
assumptions don’t hold etc.

It’s a question of when, not if, we see far faster Ruby
implementations
than the current range of VMs.

Vidar

alexisrichardson · May 13, 2008, 5:27am

Vidar H. wrote:

compiler? How much would this help scripts?

There’s a huge amount of room for improvement in the still. I haven’t
seen any of the Ruby implementations even try to apply more
sophisticated
VM/JIT techniques such as tracing and polymorphic inline caches
properly.

I’ve experimented with PICs in JRuby, and for most tests I ran they did
not help very much. Granted, there’s an unfortunate lack of nontrivial
polymorphic benchmarks in the wild, so it’s possible a good PIC would
help more on real apps.

What does make a huge difference for JRuby is eliminating some of the
extra overhead related to rarely-used Ruby features. For example, the
difference between a normal run and a run that eliminates an unnecessary
frame object, uses fast dispatch for math operations, and eliminates
some thread checkpointing:

~/NetBeansProjects/jruby âž” jruby -J-server
test/bench/bench_fib_recursive.rb
0.746000 0.000000 0.746000 ( 0.746000)
0.371000 0.000000 0.371000 ( 0.370000)
0.357000 0.000000 0.357000 ( 0.357000)
0.357000 0.000000 0.357000 ( 0.357000)
0.358000 0.000000 0.358000 ( 0.357000)
~/NetBeansProjects/jruby âž” jruby -J-server
-J-Djruby.compile.fastest=true test/bench/bench_fib_recursive.rb
0.960000 0.000000 0.960000 ( 0.959000)
0.243000 0.000000 0.243000 ( 0.243000)
0.238000 0.000000 0.238000 ( 0.238000)
0.237000 0.000000 0.237000 ( 0.237000)
0.235000 0.000000 0.235000 ( 0.235000)

And Ruby 1.9:

~/NetBeansProjects/jruby âž” …/ruby1.9/ruby -I …/ruby1.9/lib
test/bench/bench_fib_recursive.rb
0.400000 0.010000 0.410000 ( 0.412421)
0.400000 0.000000 0.400000 ( 0.407236)
0.400000 0.000000 0.400000 ( 0.415222)
0.400000 0.010000 0.410000 ( 0.417042)
0.400000 0.000000 0.400000 ( 0.452934)

So yes, there’s definitely room to improve all the implementations…

Charlie

alexisrichardson · May 23, 2008, 9:57pm

Charles Oliver N. wrote:

What does make a huge difference for JRuby is eliminating some of the
extra overhead related to rarely-used Ruby features. For example, the
difference between a normal run and a run that eliminates an unnecessary
frame object, uses fast dispatch for math operations, and eliminates
some thread checkpointing:

Maybe something is possible along the lines of

vm_optimized :no_frame_pointer, :fast_math, :no_thread_checkpointing do
# some code that should run very fast
end

Thanks for your work
-R

alexisrichardson · June 1, 2008, 3:21am

Yeah, I’m looking into those possibilities, trying to find a
nonintrusive way to introduce compiler pragmas that we could use for
implementing parts of JRuby in Ruby code (or that others could use). For
example, something like this (a bogus name…I don’t want to reveal any
pragmas yet):

def foo
____NO_FRAMING = true
end

One option that YARV could use is compiler definitions.
Programatically would seem easier on the user.
-R

alexisrichardson · June 1, 2008, 3:15am

Roger P. wrote:

Maybe something is possible along the lines of

vm_optimized :no_frame_pointer, :fast_math, :no_thread_checkpointing do
# some code that should run very fast
end

Thanks for your work
-R

Yeah, I’m looking into those possibilities, trying to find a
nonintrusive way to introduce compiler pragmas that we could use for
implementing parts of JRuby in Ruby code (or that others could use). For
example, something like this (a bogus name…I don’t want to reveal any
pragmas yet):

def foo
____NO_FRAMING = true
end

Charlie

alexisrichardson · June 1, 2008, 2:05pm

On 01.06.2008 03:14, Charles Oliver N. wrote:

Yeah, I’m looking into those possibilities, trying to find a
nonintrusive way to introduce compiler pragmas that we could use for
implementing parts of JRuby in Ruby code (or that others could use). For
example, something like this (a bogus name…I don’t want to reveal any
pragmas yet):

def foo
____NO_FRAMING = true
end

That’s an interesting idea, but I’d rather separate Ruby code from
platform specific options. So I’d prefer command line arguments to the
engine (as you presented before) or a smart mechanism to store switches
externally (e.g. .rc file). Ideally there would even be a mechanism
that finds optimal switches automatically, but I guess that would be
really hard.

Kind regards

robert

alexisrichardson · June 2, 2008, 5:31pm

On Tue, May 13, 2008 at 12:26:48PM +0900, Charles Oliver N. wrote:

So yes, there’s definitely room to improve all the implementations…
Similar speedups with ludicrous, which isn’t even very smart about its
optimizations:

cout@bean:~/download/jruby/jruby/test/bench$ ruby1.9
bench_fib_recursive.rb
1.450000 0.030000 1.480000 ( 1.628318)
0.930000 0.020000 0.950000 ( 1.956004)
0.920000 0.020000 0.940000 ( 0.937877)
0.930000 0.020000 0.950000 ( 0.958286)
0.930000 0.020000 0.950000 ( 0.946474)

cout@bean:~/download/jruby/jruby/test/bench$ ludicrous
bench_fib_recursive.rb
0.670000 0.010000 0.680000 ( 0.679764)
0.670000 0.020000 0.690000 ( 0.688893)
0.680000 0.010000 0.690000 ( 0.695944)
0.680000 0.010000 0.690000 ( 0.694999)
0.680000 0.020000 0.700000 ( 0.696761)

Paul

alexisrichardson · June 2, 2008, 12:20am

Roger P. wrote:

  # some code that should run very fast

end

A flag is problematic for a couple reasons:

Most of the optimizations I’m trying to specify are not compatible
with everything in Ruby; when enabled they’ll limit available features
to a subset that doesn’t incur as much runtime overhead (and this is
overhead both JRuby and MRI contend with).
For the cases where there are optimizations that can be applied
globally…we just spin a new release and apply them globally. There’s
not really a need to hold back on such things if they don’t break Ruby.

Ideally, we’d be able to apply these optimizations everywhere they’ll be
safe automatically, but that’s a very hard problem with Ruby’s dynamic
nature. So I see these as more a “programmer promise” that they won’t
use certain higher-overhead features in exchange for better performance.
It’s not something I’d see a lot of people using for general apps, but
it might be useful when building JRuby internal or core framework code.

I’m open to all thoughts on this though. There’s lots and lots of things
we can optimize by incrementally shutting down particular features. And
I think it’s a reasonable choice to offer people…if you want something
faster and are willing to give up a little, it should be your choice.
And yes, I fully appreciate the compatibility aspect of this…so it’s
definitely not intended for the uninitiated and probably not for general
use.

Charlie