Ruby thread-dump Gem


#1

So a huge feature missed from the JVM is the ability to send SIGQUIT
(ctrl-break) and get a list of running threads. This feature is a
killer feature for tracking down hung processes, as it will allow you
to see where threads are hung up.

I’ve written a small C extension for Ruby that will send a list of
threads and their current file/line number executing to STDERR upon
receiving a SIGQUIT. Simply install the gem and require ‘thread-dump’,
and a trap will be registered with SIGQUIT within that process. (So,
if you’re fork’ing stuff, you need to do the require within the fork,
for example.)

Still waiting on the rubyforge project, if you really need this ASAP
(as we did), I’ve put it up at google pages:

http://gfodor.googlepages.com/thread-dump-0.0.1.gem

This helped us track down a nasty bug that was occuring due to the
lack of a timeout in Net::HTTP during SSL connect, which still seems
to be busted in ruby trunk.

The extension is still very primitive, I was unable to decipher the
Ruby C necessary to unwind the full stack frame for a thread that is
not currently executing, but fortunately was able to at least get the
bottom of the stack using the RNode held by the *thread. Help with
this feature would be greatly appreciated!

It should be up at thread-dump.rubyforge.org sooner or later.


#2

From: “Greg F.” removed_email_address@domain.invalid

I’ve written a small C extension for Ruby that will send a list of
threads and their current file/line number executing to STDERR upon
receiving a SIGQUIT.

Nice. I’ve wanted this ability a few times.

This helped us track down a nasty bug that was occuring due to the
lack of a timeout in Net::HTTP during SSL connect, which still seems
to be busted in ruby trunk.

Could you describe this in more detail, or possibly post a diff
of the changes you made to fix the bug? (We’re about to ship a
product with embedded ruby using Net::HTTP and SSL, so it would
be great to be able to eliminate any such lurking bug.)

Thanks,

Bill


#3

On Jul 20, 1:06 am, “Bill K.” removed_email_address@domain.invalid wrote:

This helped us track down a nasty bug that was occuring due to the
Bill
We didn’t fix it directly, we worked around it by timing out all
requests in the outer caller. The bug seems to be inside of def
connect, there is a call to “s.connect” if ssl is enabled, and this
call is not timed out. Some of our processes were hanging on this
call.


#4

From: “Greg F.” removed_email_address@domain.invalid

be great to be able to eliminate any such lurking bug.)

We didn’t fix it directly, we worked around it by timing out all
requests in the outer caller. The bug seems to be inside of def
connect, there is a call to “s.connect” if ssl is enabled, and this
call is not timed out. Some of our processes were hanging on this
call.

Interesting. We’ve been seeing an issue with “s.connect” as well,
but only on Windows (ruby 1.8.4), and oddly only when ruby is
embedded into our C++ app, and only the first time the SSL
connect takes place.

For us, we’d see the CPU pegged for about 20 seconds down in
openssl.so -> ssleay.dll -> libeay.dll. But it would eventually
return. After that, all subsequent SSL connect calls would
execute quickly.

I was wondering if it was doing some one-time generation of a
private key or something. . . . (But why only when ruby was
embedded in our C++ app? Something missing from the environment,
I wondered…?)

Anyway, I wasn’t getting very far debugging it as I didn’t have
symbols for ruby or the ssl libraries. (I was using binaries
from the One-click installer.) So I built ruby 1.8.4 and
openssl locally with debug symbols, updating to a newer version
of OpenSSL in the process. (0.9.8e)

The result: The unexplained “s.connect” delay seems to have
vanished.

I would be happier if I knew what had been causing the problem;
maybe it’s still lurking. But it used to happen like clockwork,
and since rebuilding ruby and a newer OpenSSL, I’ve yet to see
the problem again.

Incidentally our app also runs on OS X, and I have yet to see
this “s.connect” problem over there?

What platform(s) are you seeing it on? In your case, it sounded
like it may have been hanging indefinitely on you, as opposed to
being a ~20 second delay that would eventually return?

Regards,

Bill


#5

What platform(s) are you seeing it on? In your case, it sounded
like it may have been hanging indefinitely on you, as opposed to
being a ~20 second delay that would eventually return?

Yup, this happens on multiple fedora core servers, and they were hung
up for several hours. We didn’t notice this until we started trying
to gracefully shutting down these threads, and realized a good
chunk of them were stuck.

This little tool quickly revealed which line was the culprit :slight_smile: