Ruby 1.9.3-p194, Rails 3.1.1, Apache2.2.16, passenger 3.0.14, debian
6.0.5, linux 2.6.32-5-amd64
Recently we’ve encountered a problem with ruby/rack processes that hang
randomly and use 100% of a cpu core while processing requests. This
happens on several linux systems, with our production configuration
listed above. Our system communicates a lot with backend nodes using
bert rpc protocol, so we keep some number of connections needed to serve
a request.
Now, we’ve attached gdb to these processes and found out that they loop
in ext/socket/init.c in wait_connectable. There is a comment that
doesn’t match what I see in the code.
While debugging we’ve seen situations in which either both flags
(RB_WAITFD_IN and RB_WAITFD_OUT) or one of them were set, but not
sockerr (so it looped due to “winsock workaround”). Moreover, applying
the attached patch seems to solve these issues. Note that I kept
original conditions, so they still don’t match what the comment about
the book says. We’ve observed it hanging on connections other than to
our backend.
However, we don’t know what exactly triggers this behavior and how this
code is supposed to work, and this is a stable ruby release after all,
so I feel uneasy about
keeping this patch.
So, does somebody know how this code is supposed to work, what could
trigger this condition, if this possibly is a bug, if so if this fix is
valid and if this change breaks anything obvious? Or should I write
on the dev list? Thanks.
The comment I refered to:
“Stevens book says, succuessful finish turn on RB_WAITFD_OUT and
failure finish turn on both RB_WAITFD_IN and RB_WAITFD_OUT.”
Our quick patch attached.
Marcin