Francois S. inadvertently found a way to replicate a rare but deadly
bug right as I was working up the official release of Mongrel 0.3.13.
This bug only happened to a few people, but thanks to the wonderful
fuzzing tool Apache Bench he could replicate the slow select
starvation people were seeing.
This bug is now fixed in the current pre-release, and I’d like everyone
to grab it in the usual way:
$ gem install mongrel --source=http://mongrel.rubyforge.org/releases/
And run Francois little mongrel killer script (in a bash shell):
while true; do ab -n 1000 -c 30 http://localhost:3000/ 2>/dev/null |
grep quest.*mean ; echo – ; done
You will need Apache Bench for this to work.
Specifically hit this against a file and potentially a Rails action or
two. Also try varying the -c option to higher levels.
You will probably see a bunch of Broken Pipe or other errors when you
hit files, but otherwise you should see the same performance for the
same -c settings. If you see the performance degrade over time (very
quickly) then shoot me an e-mail with the errors you see and the
operating system you use.
If nobody has problems with this release then I’ll make it official
In general, if you are using Ruby threads and a socket produces an
error, then don’t use that socket anymore. Turns out that Ruby will
happily let you continue using the socket, but most OS select()
functions have strange semantics with dead sockets. In this case, Ruby
is most likely waiting for a read or write event on the socket, but
since it’s dead there will be none.
On most operating systems this turns out to put any thread that uses
that socket into a permanently sleeping state.
What should happen is that any socket that throws an exception should
be put into an invalid state so that further operations on it blow up
and then any invalid sockets are not put into the select loop for
I’ve got tests going which exercise this bug and have refactored all of
the socket usage so that exceptions are detected and the socket isn’t
used anymore after that. Hopefully this squashes it permanently.
Thanks for the help.
Zed A. Shaw
 Apache Bench is the worst piece of crap on the planet. The reason
this bug shows up is because ab actually violently closes sockets on any
connections that “take too long”. It of course doesn’t define what “too
long” is and doesn’t tell you it’s going to do this. Use httperf. This
behavior is available with httperf but you have to turn it on
explicitly. If you need to see what happens when really nasty clients
hit your server in the thousands, then use ab. Otherwise it’s
performance measurements are total crap since it can’t possibly be
getting an accurate reading if it’s closing most of the sockets.