Nginx 1.4.1 - slow transfers / connection resets

alcina · August 19, 2013, 2:07pm

Hi,

I have a nginx (stock ubuntu config) as a reverse proxy in front of a
haproxy in front of 5 more nginx machines which use fastcgi to talk to
php-fpm.

My issue is with the frontend proxy and long-running, veeeeery
slowwwww requests.

The clients are very underpowered mobile barcode scanners using 2G GSM
connections. When they try to download 2.1 MB of data dynamically
generated by PHP on the back, backend, the Frontend will close the
connection after ~1MB has been downloaded (at ~2 KBytes/s).

I can reproduce the same behavior using curl (with --rate-limit 2K):

% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left
Speed
100 1196k 0 1196k 0 343 2047 0 --:–:-- 0:09:58 --:–:–
1888
curl: (56) Recv failure: Connection reset by peer

The access log on the frontend server lists a 200 status code but too
few transmitted bytes.

The error log (on info) shows

2013/08/19 14:03:36 [info] 32469#0: *1166 client timed out (110:
Connection timed out) while sending to client, client: xxx.xxx.xxx.xxx

Which is not true - it’s showing that while curl (–rate-limit 2K) ist
still running.

Can you give me any pointers in how to debug/fix this?

Philip

–
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
[email protected], http://www.sensational.ch

Philip_Hofstetter · August 19, 2013, 7:05pm

Hi,

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1196k 0 1196k 0 343 2047 0 --:–:-- 0:09:58 --:–:-- 1888
curl: (56) Recv failure: Connection reset by peer

Looks like there is some timeout at 600 seconds (Time Spent: 0:09:58)?
Any match
in the haproxy or nginx configurations?

I have a nginx (stock ubuntu config) as a reverse proxy in front of a
haproxy in front of 5 more nginx machines which use fastcgi to talk to
php-fpm.

Since you can reproduce it with curl, why not track the issue down to a
specific part of your infrastructure (try on the nginx backends first,
then on the haproxy box, and then on the frontent nginx box).

Lukas

Philip_Hofstetter · August 20, 2013, 9:14am

Hi,

On Mon, Aug 19, 2013 at 7:05 PM, Lukas T. [email protected]
wrote:

Looks like there is some timeout at 600 seconds (Time Spent: 0:09:58)? Any match
in the haproxy or nginx configurations?

That’s consistent with what nginx is logging to the error log. But it
doesn’t make sense as there is data being transmitted.

I have a nginx (stock ubuntu config) as a reverse proxy in front of a
haproxy in front of 5 more nginx machines which use fastcgi to talk to
php-fpm.

Since you can reproduce it with curl, why not track the issue down to a
specific part of your infrastructure (try on the nginx backends first,
then on the haproxy box, and then on the frontent nginx box).

That’s what I’ve done before coming here. No issues on either haproxy
or the backend.

Here’s the debug log of just the failing request (thanks, nginx, for
making error_log a directive that can be used in a location block):

http://www.gnegg.ch/debug.log

Philip

–
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
[email protected], http://www.sensational.ch

Philip_Hofstetter · August 19, 2013, 2:17pm

Hello!

On Mon, Aug 19, 2013 at 02:07:09PM +0200, Philip Hofstetter wrote:

connections. When they try to download 2.1 MB of data dynamically
The access log on the frontend server lists a 200 status code but too
Can you give me any pointers in how to debug/fix this?
Debug log should be helpful, see
A debugging log.

–
Maxim D.
http://nginx.org/en/donation.html

Philip_Hofstetter · August 20, 2013, 9:24am

The last debug log I sent is not showing the full picture. In this
case, I was aborting the curl command once nginx has logged an
incomplete response (status=200 but too short length) to access.log,
but while it was still transferring data (how’s that even possible)?

Hence the “connection reset by peer” in the log.

I’m now making a second log, this time waiting it out. I will also
produce a third log transferring a static file from the backend nginx
in order to rule out fastcgi issues.

Philip

On Tue, Aug 20, 2013 at 9:14 AM, Philip Hofstetter
[email protected] wrote:

Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
[email protected], http://www.sensational.ch

–
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
[email protected], http://www.sensational.ch

Philip_Hofstetter · August 20, 2013, 9:50am

Ok. I have three debug logs now:

http://www.gnegg.ch/debug-cancel.log
is the first log I created where I quit curl once nginx has logged a
200 status with a truncated length to the access log (how can it log
success while it’s still transmitting data?)

http://www.gnegg.ch/debug-full.log
is the same request, but this time waiting for curl to complain about
the connection reset. Again, nginx logs a 200 with truncated length
(way before curl bails out)

http://www.gnegg.ch/debug-staticfile.log
Is me downloading a static file from one of the backend servers. This
shows the same behavior as the dynamically generated response and
helps ruling out fastcgi issues.

To add a further note: The machine which shows this issue is under
considerable load. When I run the tests against and identical machine
which is not under load, the download runs correctly (until I do put
it under load at which point it fails the same way).

The fact that nginx logs the request as successful (but truncated)
while it’s still ongoing does kinda point to a kernel issue, but I’m
really just guessing at this point.

Philip

On Tue, Aug 20, 2013 at 9:23 AM, Philip Hofstetter
[email protected] wrote:

in the haproxy or nginx configurations?
then on the haproxy box, and then on the frontent nginx box).

Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
[email protected], http://www.sensational.ch

–
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
[email protected], http://www.sensational.ch

Philip_Hofstetter · August 20, 2013, 1:27pm

Hello!

On Tue, Aug 20, 2013 at 09:49:32AM +0200, Philip Hofstetter wrote:

Ok. I have three debug logs now:

http://www.gnegg.ch/debug-cancel.log
is the first log I created where I quit curl once nginx has logged a
200 status with a truncated length to the access log (how can it log
success while it’s still transmitting data?)

A http status code nginx logs to access log corresponds to the
code sent to a client. As the code was already sent at the time
the problem was detected - it’s 200.

To add a further note: The machine which shows this issue is under
considerable load. When I run the tests against and identical machine
which is not under load, the download runs correctly (until I do put
it under load at which point it fails the same way).

The fact that nginx logs the request as successful (but truncated)
while it’s still ongoing does kinda point to a kernel issue, but I’m
really just guessing at this point.

Both full logs show that nothing happens in 60 seconds (while there are
unsent data pending):

2013/08/20 09:33:31 [debug] 1692#0: *1101651 http output filter
“/index.php/winclient/gnegg?”
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http copy filter:
“/index.php/winclient/gnegg?”
2013/08/20 09:33:31 [debug] 1692#0: *1101651 image filter
2013/08/20 09:33:31 [debug] 1692#0: *1101651 xslt filter body
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http postpone filter
“/index.php/winclient/gnegg?” 00000000022A7218
2013/08/20 09:33:31 [debug] 1692#0: *1101651 write new buf t:0 f:0
0000000000000000, pos 000000000231CAF0, size: 4096 file: 0, size: 0
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http write filter: l:0 f:1
s:4096
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http write filter limit 0
2013/08/20 09:33:31 [debug] 1692#0: *1101651 writev: 1953

Note: only 1953 of 4096 bytes were sent.

2013/08/20 09:33:31 [debug] 1692#0: *1101651 http write filter
00000000022A7228
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http copy filter: -2
“/index.php/winclient/gnegg?”
2013/08/20 09:33:31 [debug] 1692#0: *1101651 event timer del: 141:
1376984038781
2013/08/20 09:33:31 [debug] 1692#0: *1101651 event timer add: 141:
60000:1376984071388

Note: timer was set to timeout after 60 seconds.

2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream request:
“/index.php/winclient/gnegg?”
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream process non
buffered upstream
2013/08/20 09:33:31 [debug] 1692#0: *1101651 event timer: 141, old:
1376984071388, new: 1376984071390
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream request:
“/index.php/winclient/gnegg?”
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream dummy handler
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream request:
“/index.php/winclient/gnegg?”
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream process non
buffered upstream
2013/08/20 09:33:31 [debug] 1692#0: *1101651 event timer: 141, old:
1376984071388, new: 1376984071645
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream request:
“/index.php/winclient/gnegg?”
2013/08/20 09:33:31 [debug] 1692#0: *1101651 http upstream dummy handler
2013/08/20 09:34:31 [debug] 1692#0: *1101651 event timer del: 141:
1376984071388
2013/08/20 09:34:31 [debug] 1692#0: *1101651 http run request:
“/index.php/winclient/gnegg?”
2013/08/20 09:34:31 [debug] 1692#0: *1101651 http upstream process non
buffered downstream
2013/08/20 09:34:31 [info] 1692#0: *1101651 client timed out (110:
Connection timed out) while sending to client, client: 80.219.149.116,
server: , request: “POST /index.php/winclient/gnegg HTTP/1.0”, upstream:
“http://127.0.0.1:8081/index.php/winclient/gnegg”, host:
“REDACTED.popscan.ch”
2013/08/20 09:34:31 [debug] 1692#0: *1101651 finalize http upstream
request: 408

After a 60 seconds timer was fired and client connection was
closed as timed out.

That is, from nginx point of view everything looks like a real
timeout.

Unfortunately, with location-level debug logs it’s not possible to
see event handling details (and that’s why it’s generally
recommended to activate debug log at global level, BTW). But I
would suppose everything is fine there as well, and the problem is
actually a result of kernel’s behaviour.

–
Maxim D.
http://nginx.org/en/donation.html

Philip_Hofstetter · August 20, 2013, 3:51pm

Hello!

On Tue, Aug 20, 2013 at 03:14:20PM +0200, Philip Hofstetter wrote:

that’s not under load, the problem doesn’t appear.
60 requests per second is low enough, just switching on debug log
should work.

But I would suppose everything is fine there as well, and the problem is
actually a result of kernel’s behaviour.

I started suspecting as much. Any pointers how I could work around/fix
the issue on the kernel level?

No exact recommendation, but likely it’s related to buffering at
some point. First of all I would recommend to look at what
actually happens on the wire with tcpdump/wireshark. If there is
indeed transfer stall for 60+ seconds - you should look at the
client’s side of the TCP connection, i.e. either client’s kernel
or curl. If there is continous flow of packets - it’s likely
something to do with sending part (just in case, aren’t you using
send_lowat? if set too high it may cause such symptoms).

–
Maxim D.
http://nginx.org/en/donation.html

Philip_Hofstetter · August 20, 2013, 9:25pm

H
i

After a 60 seconds timer was fired and client connection was

closed as timed out.

Yeah. That’s what I feared. But the connection was definitely still
open and data was being transferred.

You are still testing through your 2G GSM connection, right?
How can you be sure that this connection isn’t lagging?

Can you create a network capture on server-side?

Can you bypass the dynamic backend, e.g. are the clients able to fetch a
2.1MB static file just for testing through your nginx setup?

–

Regards,
Igor

Philip_Hofstetter · August 20, 2013, 3:15pm

Hello!

On Tue, Aug 20, 2013 at 1:26 PM, Maxim D. [email protected]
wrote:

2013/08/20 09:34:31 [debug] 1692#0: *1101651 http upstream process non buffered
downstream
2013/08/20 09:34:31 [info] 1692#0: *1101651 client timed out (110: Connection
timed out) while sending to client, client: 80.219.149.116, server: , request:
“POST /index.php/winclient/gnegg HTTP/1.0”, upstream:
“http://127.0.0.1:8081/index.php/winclient/gnegg”, host: “REDACTED.popscan.ch”
2013/08/20 09:34:31 [debug] 1692#0: *1101651 finalize http upstream request: 408

After a 60 seconds timer was fired and client connection was
closed as timed out.

Yeah. That’s what I feared. But the connection was definitely still
open and data was being transferred.

Unfortunately, with location-level debug logs it’s not possible to
see event handling details (and that’s why it’s generally
recommended to activate debug log at global level, BTW).

any idea how to do this on a system that’s under load (60 requests per
second)? As I said before: When I do the same request on a system
that’s not under load, the problem doesn’t appear.

But I would suppose everything is fine there as well, and the problem is
actually a result of kernel’s behaviour.

I started suspecting as much. Any pointers how I could work around/fix
the issue on the kernel level?

Philip

–
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile +41 79 341 01 99
[email protected], http://www.sensational.ch

Philip_Hofstetter · March 28, 2014, 11:42am

We hit the same problem. First off, to work around this problem add the
following to your nginx configuration:

send_timeout 3600;

This will increase the time out to one hour which is long enough to
empty the buffer with a slow connection.

As to what’s happening during these 60 seconds I can answer. The traffic
from the server continues to flow non-stop to the client during the 60
seconds and after that, until the client downloads so much data as is
reported in the access.log.

I am able to reproduce this problem connecting using “curl --limit-rate
5k” to localhost serving a static file from the file system.

I hope this helps someone in the future.

/Igor