Nginx/Mongrel hanging requests

marcbey · July 6, 2007, 9:35pm

Hello guys.

We have a strange issue with our rails app deployed on a linux box with
a nginx in front and a mongrel cluster behind.
Mostly the app runs straight, fast enough - everything is fine.
But sometimes, a request needs up to 60 seconds to finish. Meanwhile
top shows no busy mongel or nginx processes. Lsof also shows no open
blocking file handels. It seems that nothing happens while the request
is hanging.
We played with other versions of nginx and some configuration options
but without success.

These error messages from thenginx.log maybe indicate that the problem
has nothing to do with the nginx:

2007/07/06 20:20:22 [info] 23329#0: *28 client 66.249.72.227 closed
keepalive connection
2007/07/06 20:20:53 [error] 23332#0: *11 upstream timed out (110:
Connection timed out) while reading response header from upstream,
client: 85.180.37.66, server: www.groops.de, URL: “/foobar/”, upstream:
“http://88.198.10.179:3000/foobar/”, host: “www.groops.de”, referrer:
“http://www.groops.de/dashboard”

Maybe somebody here made similar experiences and could give us a hint.
That would be great!

Greetings, Marc

ruby 1.8.6, rails 1.2.3, mongrel 1.0.1, nginx 0.5.26, Suse Linux
2.6.16.13-4-smp

marcbey · July 6, 2007, 9:39pm

Check your database server and see if there are mysql queries that
are still finishing.

Jesse
Blue Box Group, LLC

marcbey · July 6, 2007, 9:43pm

On Jul 6, 2007, at 12:35 PM, Marc Beyerlin wrote:

is hanging.
client: 85.180.37.66, server: www.groops.de, URL: “/foobar/”,
upstream:
“http://88.198.10.179:3000/foobar/”, host: “www.groops.de”, referrer:
“http://www.groops.de/dashboard”

Maybe somebody here made similar experiences and could give us a hint.
That would be great!

Greetings, Marc

The default proxy_read_timeout is set to 60 seconds. So it would
appear that one of your backends is not responding within 60 seconds
and so nginx waits and then sends the request to the next_upstream.
Hence the request finishing in 60 seconds.

It’s hard to say why your backend is non responsive without more
info. Do you have any real long running actions? Are you using monit
and maybe catching a mongrel when its being rebooted?

Cheers-
– Ezra Z.
– Lead Rails Evangelist
– [email protected]
– Engine Y., Serious Rails Hosting
– (866) 518-YARD (9273)

marcbey · July 6, 2007, 10:03pm

Hey Ezra!

The default proxy_read_timeout is set to 60 seconds. So it would
appear that one of your backends is not responding within 60 seconds
and so nginx waits and then sends the request to the next_upstream.
Hence the request finishing in 60 seconds.

Yes, it seems that a hanging request needs 60 seconds to finish
(although i also measured one that needs 120 seconds to finish).

It’s hard to say why your backend is non responsive without more
info. Do you have any real long running actions? Are you using monit
and maybe catching a mongrel when its being rebooted?

No, there are no expensive, long running actions, we also havent seen
this behavoir with a single mongrel process so far.
We dont use monit or something simliar.

Greetings, Marc

Cheers-
– Ezra Z.
– Lead Rails Evangelist
– [email protected]
– Engine Y., Serious Rails Hosting
– (866) 518-YARD (9273)

marcbey · July 6, 2007, 10:22pm

Did you check your MySQL process list?

Jesse

marcbey · July 9, 2007, 2:36pm

Hey Jesse.

Thanx for your hint. We are using Postgres 8.1.9 and i have checked the
open connection and running queries but havent seen anything suspicious.

Greetings, Marc

Did you check your MySQL process list?

Jesse

marcbey · July 10, 2007, 9:24am

Marc Beyerlin wrote:

Problem fixed:
The problem caused a mis-configured resolv.conf whereby dns lookups of
internal webservice requests timed out.

As the problem was quite obscure and might hit others in the future, I
will go a little more into details:

We had been sure that some timeout caused our problem and besides many
other things checked DNS-lookup of all internal and external addresses
but couldn’t reproduce a timeout with dig or by pinging a server.

Using strace we found out, that the hanging process was indeed waiting
for a nameserver reply and using tcpdump we found out that these
requests timed out. The funny thing: These were no normal DNS-requests
but requests for resolving IPv6-addresses which our SuSE 10.1 sends out
in parallel. The first request went to something like “some.server.de”
and was answered by the DNS-server. A second request tried
“some.server.de.site” (my unconfigured “resolv.conf”) and was not
anserwed at all causing the timeout.

I switched to something like “server.de” and even if a lookup of
“some.server.de.server.de” looks funny, it is at least answered and all
problems are gone.

Lesson 1: IPv6 is enabled with all modern Linux distributions and does
its own magic

Lesson 2: Shutting down a local nscd (Nameservice caching daemon) helps
finding out issues with DNS-lookup faster

Lesson 3: An “lsof -p pid” ends with a list of all open file descriptors
and tells you more about polls found with strace

Lesson 4: Nginx is fast and reliable! I love it!

Thanks for your thoughts and suggestions, Jan

marcbey · July 9, 2007, 9:56pm

Problem fixed:
The problem caused a mis-configured resolv.conf whereby dns lookups of
internal webservice requests timed out.

Greetings, Marc