For performance, write it in C - Part 3, Source code now ava

tekhne · August 1, 2006, 10:49am

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

tekhne · August 1, 2006, 11:02am

On Tue, Aug 01, 2006 at 05:48:41PM +0900, Peter H. wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

Great! Thanks.

tekhne · August 1, 2006, 6:22pm

Peter H. wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

There are some details missing from the webpages

which C implementation?
which Java implementation?
what hardware?

For example, using the code from the webpage

gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)

gcc -pipe -Wall -O3 -fomit-frame-pointer -funroll-loops -march=pentium4
latin.c -o latinc
time ./latinc > /dev/null 2>&1

user 0m0.820s
sys 0m0.000s

/sun-jdk-1.5.0.07/bin/javac Latin.java
time java Latin > /dev/null 2>&1

user 0m3.800s
sys 0m0.644s

2GHz Intel P4

tekhne · August 1, 2006, 6:35pm

Isaac G. wrote:

Peter H. wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

There are some details missing from the webpages

which C implementation?

[peterhickman]$ gcc -v
Using built-in specs.
Target: powerpc-apple-darwin8
Configured with: /private/var/tmp/gcc/gcc-5341.obj~1/src/configure
–disable-checking -enable-werror --prefix=/usr --mandir=/share/man
–enable-languages=c,objc,c++,obj-c++
–program-transform-name=/^[cg][^.-]*$/s/$/-4.0/
–with-gxx-include-dir=/include/c++/4.0.0 --with-slibdir=/usr/lib
–build=powerpc-apple-darwin8 --host=powerpc-apple-darwin8
–target=powerpc-apple-darwin8
Thread model: posix
gcc version 4.0.1 (Apple Computer, Inc. build 5341)

which Java implementation?

[peterhickman]$ javac -version
javac 1.5.0_06

Additionally

[peterhickman]$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 6) configuration:

[peterhickman]$ ruby -v
ruby 1.8.4 (2005-12-24) [powerpc-darwin]

what hardware?

Macintosh G4 with 1Gb ram

tekhne · August 1, 2006, 6:49pm

Ok, so there’s a bunch of problems with the Java version.

In addition to the addRow run and the Java startup time you’re also
benchmarking over 5200 array modifications to set Compared values to
true
Your variable naming is entirely contrary to every Java coding
convention
published (not a benchmark thing, but it sets off any Java devs warning
flags)
Almost all of the time spent running is spent building and printing
strings

Benchmarking just the algorithm run itself with no gross string creation
and
printing, I’m getting in the neighborhood of 370ms per invocation once
HotSpot has optimized the code. I’ll have more detailed numbers shortly.

tekhne · August 1, 2006, 7:35pm

Peter H. wrote:

Isaac G. wrote:

Peter H. wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

I recall someone stating “benchmarking without analysis is bogus”.

As a first step, comment out the print statements

time ./latinc

user 0m0.492s
sys 0m0.004

time java Latin

user 0m0.992s
sys 0m0.052s

With the print statements ~5.4x
without print statements ~2.1x

iirc the Java program is shuffling around double byte unicode chars and
the C program is handling single byte chars.

tekhne · August 1, 2006, 7:57pm

Peter H. wrote:

The source code is available from
http://peterhi.dyndns.org/write_it_in_c/index.html

Hmm, it would look much better for me if you had included Kristof
Bastiaensen’s Curry version
into the party… Heck, if that code is not smart then nothing is, and
in a way, it’s a much more
interesting question how a compiled functional language compares to
compiled imepative ones
than the thousand time discussed interpreted vs. compiled match-up.

Yes, Curry is anything but mainstream, but you can’t say a decent
compiler is not at your disposal.
The Munster CC (which was used by Kristof, and AFAIK that’s the most
commonly used implementation) does even have an IDE for OS X.

Regards,
Csaba

tekhne · August 1, 2006, 8:46pm

On 8/1/06, Isaac G. [email protected] wrote:

iirc the Java program is shuffling around double byte unicode chars and
the C program is handling single byte chars.

Yes, that’s one of the biggest problems with the code. The Java version
uses
all 16-bit UTF-16 character strings internally and then normalizes to
the
platform’s preferred encoding (usually ISO-8859 or some variation of
it). If
you really want the prints (which you SHOULDN’T because benchmarking a
numeric algorithm and including IO is bogus), then make the C version do
the
same amount of work…wide char strings and normalize to ASCII on write.

tekhne · August 1, 2006, 8:39pm

The Java code was broken in many different ways, such that any numbers
generated using this code are badly skewed against Java. You can’t run
benchmarks against Java, to prove Java’s slow, and then write code
that’s
obviously crippling it. I’ll only fix the blatant mistakes here…others
may
do a deeper cleanup of the Java code if they wish. I’ll post my code if
requested, but any Java programmer will understand the optimizations as
I
list them below. They’re pretty obvious.

And I don’t doubt that C will probably be faster, even with those
optimizations…but it won’t be faster by much and certainly not by
orders
of magnitude. Algorithms where Java does especially well are any that
involve memory allocation, which doesn’t come into play here but which
is
applicable to almost all real-world code. The point of this thread is
that
using the underlying platform code–C for Ruby, Java for JRuby–will
often
help many algorithms…and this much is true. But don’t venture into
comparing C against Java if you’re going to make blanket statements
using
flawed tests.

First, some notes on benchmarking:

NEVER include IO when benchmarking a numeric algorithm; IO speeds vary
greatly from system to system and can vary from run to run depending on
what
else is happening
Do not include initialization code in benchmarks, especially in this
case
where you’re manually tweaking a gigantic array
If you’re building up a large chunk of strings, write to an internal
buffer and then write out at the end; don’t write for every little tiny
operation. At the very least, use a buffer per-line, rather than a
separate
write for every element on that line.
Make sure you’re actually testing the same thing on all systems; in
this
case, the Java code was severely crippled on a number of levels
I have not changed the algorithm in any way, but an iterative
algorithm
would perform a lot better on all platforms.

So I made some mild optimizations:

You do not need to compare boolean values to “true” or “false”; just
use
them directly as the test condition.
Write strings to an internal buffer or do not write them at all; to
support Unicode across platforms, Java normalizes text to the base
platform’s preferred encoding, and so incurs extra overhead for this
benchmark than the other versions. If you want to make this a better
test,
have the C version use wide char strings internally and normalize to
ASCII
on write.
I moved the initialization of the Compared array to a separate
function
and excluded it from the test. I clear out and reinit the string buffer
and
the compared array for each test run. The C code loads a static array
into
memory in probably microseconds, so including this initialization for
the
Java test totally skews results.
I had the test benchmark just the call to addRow, since the Java
platform
overhead is a fixed cost outside of this test. If you want to figure
that
cost in, you’re welcome to…I’ve left the timings as in the original.
I ran the algorithm six times per test to allow the JVM to optimize
it.
Note how quickly the speed improves once HotSpot gets to it.

And one caveat: I don’t have perl set up properly on this machine, so I
wasn’t able to generate the header or run the C code. When I do I’ll
post
those numbers for comparison.

Ubuntu Linux 6 (64-bit), current supported kernel (something 2.6.15ish)
Opteron 150, 2.6GHz, 2GB RAM

All Java versions are AMD64.

Java 5, client vm, no string creation/buffering:
headius@opteron:~/latin_in_java$ time java Latin
Took 1082 ms
Took 645 ms
Took 388 ms
Took 385 ms
Took 551 ms
Took 385 ms

real 0m3.667s
user 0m3.436s
sys 0m0.032s

Java 5, client vm, write strings to internal string buffer:
headius@opteron:~/latin_in_java$ time java Latin
Took 631 ms
Took 599 ms
Took 492 ms
Took 496 ms
Took 492 ms
Took 499 ms

real 0m3.340s
user 0m3.080s
sys 0m0.116s

Java 6, client vm, no strings:
headius@opteron:~/latin_in_java$ time /usr/lib/jvm/jdk1.6.0/jre/bin/java
Latin
Took 400 ms
Took 400 ms
Took 395 ms
Took 408 ms
Took 369 ms
Took 367 ms

real 0m2.459s
user 0m2.368s
sys 0m0.032s

Java 6, client vm, write strings to internal buffer:
headius@opteron:~/latin_in_java$ time /usr/lib/jvm/jdk1.6.0/jre/bin/java
Latin
Took 531 ms
Took 497 ms
Took 478 ms
Took 478 ms
Took 486 ms
Took 494 ms

real 0m3.172s
user 0m2.940s
sys 0m0.104s

tekhne · August 1, 2006, 9:34pm

On Wed, 02 Aug 2006, Charles O Nutter defenestrated me:

First, some notes on benchmarking:

NEVER include IO when benchmarking a numeric algorithm; IO speeds vary
greatly from system to system and can vary from run to run depending on what
else is happening

IO can be noisy. I say avoid it for any benchmarking since it can
greatly influence timings. Usually the IO is not what you want to
measure so why add this variable into things?

If you’re building up a large chunk of strings, write to an internal
buffer and then write out at the end; don’t write for every little tiny
operation. At the very least, use a buffer per-line, rather than a separate
write for every element on that line.

I just informally thought I would measure a few things involving IO.
I only changed the printing and nothing else:

Unaltered test: ~3.8s
Use of StringBuffer to print out a single row: ~2.1s
Use of StringBuffer for entire run: ~1.5s
Preallocated StringBuffer for entire run: ~1.4s

As you can see IO can have a large affect on clock time. I
demonstrated
that in Java’s case the IO in the benchmark accounted for over 2/3 of
the
wall clock time (which is interesting because a decent chunk that is
left over is JVM startup overhead).

Some stack allocated space will likely improve the C run as well (and in
this case you can output it in a single write system call).

-Tom

tekhne · August 1, 2006, 9:05pm

And for the record, here are the single-run timings for Java 6 (rather
than
the same test six times in process):

With internal string buffer:
headius@opteron:~/latin_in_java$ time /usr/lib/jvm/jdk1.6.0/jre/bin/java
Latin
Took 617 ms

real 0m0.754s
user 0m0.664s
sys 0m0.044s

Without strings at all:
headius@opteron:~/latin_in_java$ time /usr/lib/jvm/jdk1.6.0/jre/bin/java
Latin
Took 368 ms

real 0m0.494s
user 0m0.420s
sys 0m0.024s

tekhne · August 1, 2006, 9:34pm

And the C version, which is quite a bit slower on my system than the
Java
version. Peter, did you confirm both of these are actually running
correctly? I didn’t do any correctness check…I just fixed what was
broken.
I’ll investigate a bit more as well.

Without string writes:
headius@opteron:~/latin_in_c$ perl gen.pl 5 > latin.h
headius@opteron:~/latin_in_c$ gcc -o latin latin.c
headius@opteron:~/latin_in_c$ time ./latin

real 0m1.536s
user 0m1.524s
sys 0m0.004s

With string writes:
…
headius@opteron:~/latin_in_c$ time ./latin > /dev/null 2>&1

real 0m1.955s
user 0m1.800s
sys 0m0.016s

tekhne · August 1, 2006, 10:04pm

Thomas E Enebo wrote:

measure so why add this variable into things?
Use of StringBuffer to print out a single row: ~2.1s

-Tom

–

http://www.tc.umn.edu/~enebo ±— mailto:[email protected] ----+
| Thomas E Enebo, Protagonist | “Luck favors the prepared |
| | mind.” -Louis Pasteur |

As you’re having so much fun, let me suggest you try converting the
OutputStrings to byte-arrays, and pre-allocating a byte buffer for
output like the approach taken with this program
http://shootout.alioth.debian.org/gp4/benchmark.php?test=fasta&lang=java&id=2

tekhne · August 1, 2006, 9:53pm

csaba wrote:

than the thousand time discussed interpreted vs. compiled match-up.

Yes, Curry is anything but mainstream, but you can’t say a decent
compiler is not at your disposal.
The Munster CC (which was used by Kristof, and AFAIK that’s the most
commonly used implementation) does even have an IDE for OS X.

While I certainly appreciate the efforts that are going into this, I
can’t help feeling it’s all completely irrelevant.

We can engage in cross-implementation pissing contests until the cows
come home. None of them help make Ruby any faster.

My question to the community: is there a comprehensive benchmark suite
for Ruby alone that we can use to tweak compilation settings, try out
different core algorithms, and improve what is currently an improvable
situation?

If not, would a port of pybench.py be a suitable start?

tekhne · August 1, 2006, 10:15pm

Charles O Nutter wrote:

… Which is extremely funny, since Common Lisp have had wicked fast
virtual machines for the last 15 years (on par with C in performance).

They should catch up with the 20th century first of all. =)

–
Ola B. (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

“Yields falsehood when quined” yields falsehood when quined.

tekhne · August 1, 2006, 10:08pm

On 8/1/06, Alex Y. [email protected] wrote:

While I certainly appreciate the efforts that are going into this, I
can’t help feeling it’s all completely irrelevant.

My only purpose in battling these benchmarks is to help dispel the
rumors
that “Java is slow,” “VMs are slow,” and so on. If Ruby does move to a
real
optimizing VM, it will be a good thing…all those folks who continue to
think that VMs are inherently bad need to join the 21st century.

We can engage in cross-implementation pissing contests until the cows

come home. None of them help make Ruby any faster.

My question to the community: is there a comprehensive benchmark suite
for Ruby alone that we can use to tweak compilation settings, try out
different core algorithms, and improve what is currently an improvable
situation?

If not, would a port of pybench.py be a suitable start?

Oh man, what I wouldn’t give for a really good community-approved
benchmark
suite. We’ve been battling performance gremlins on JRuby since I
started,
and we’re just now starting to make some good progress. However we don’t
have any really solid set of tests to use to benchmark things. I’d be
very
pleased if something existed, and I’d be willing to devote some time to
make
it happen otherwise.

tekhne · August 1, 2006, 11:10pm

On Wed, Aug 02, 2006 at 05:04:26AM +0900, Charles O Nutter wrote:

On 8/1/06, Alex Y. [email protected] wrote:

While I certainly appreciate the efforts that are going into this, I
can’t help feeling it’s all completely irrelevant.

My only purpose in battling these benchmarks is to help dispel the rumors
that “Java is slow,” “VMs are slow,” and so on. If Ruby does move to a real
optimizing VM, it will be a good thing…all those folks who continue to
think that VMs are inherently bad need to join the 21st century.

. . . but VMs actually are slow (to start, all else being equal).
There’s a trade-off, though, and VMs tend to be faster later on in
execution for extended operations (again, all else being equal). There
are other alternatives than VMs to consider, though, and the specifics
of what one wishes to accomplish should be examined before settling on
the VM (or any other implementation style) as “the answer”.

I’m kinda just babbling at this point.

tekhne · August 3, 2006, 12:09pm

Charles O Nutter wrote:

Ok, so there’s a bunch of problems with the Java version.

In addition to the addRow run and the Java startup time you’re also
benchmarking over 5200 array modifications to set Compared values to true

That was simple because I couldn’t define the array when I declared it
as I did in C.

Your variable naming is entirely contrary to every Java coding
convention
published (not a benchmark thing, but it sets off any Java devs warning
flags)

And this affects the performance?

tekhne · August 3, 2006, 12:10pm

Peter H. wrote:

Charles O Nutter wrote:

Your variable naming is entirely contrary to every Java coding
convention
published (not a benchmark thing, but it sets off any Java devs warning
flags)

And this affects the performance?

The point Charles made with saying “but it sets off any Java devs
warning flags” is that your Java coding conventions differ so much from
regular conventions that your Java coding capacity is put into doubt. In
plain speak; are you a good enough Java programmer to write an honest
benchmark version for Java?

–
Ola B. (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

“Yields falsehood when quined” yields falsehood when quined.

tekhne · August 3, 2006, 12:10pm

The output of my original C and Java versions are identical. If you
write the output to a file a straight diff -s should suffice.