Nginx upstream delays

Hello,

I have a question about sporadic long upstream response times I’m seeing
on (two of) our Nginx servers. It’s kind of hard to show and quantify,
but I’ll do my best.

One is a Django Gunicorn server. We included the upstream response time
in the Nginx access log and wrote a script to analyze them. What we see,
is that on the login page of a website (a page that does almost nothing)
95%-99% of ‘GET /customer/login/’ requests are processed within about 50
ms. The other few percent can take several seconds. Sometimes even 5s.
Our Munin graphs show no correlation in disk latency, cpu time, memory
use, etc.

I also added an access log to Gunicorn, so that I can see how long
Gunicorn takes to process requests that Nginx thinks take long. Gunicorn
has 8 workers. It can be seen that there is actually no delay in
Gunicorn. For instance, Nginx sees this (the long upstream response time
is marked red, 3.042s):

11.22.33.44 - - [06/Mar/2015:10:27:46 +0100] “GET /customer/login/
HTTP/1.1” 200 8310 0.061 0.121 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:48 +0100] “GET /customer/login/
HTTP/1.1” 200 8310 0.035 0.092 “-” “Echoping/6.0.2”
11.22.33.44 - - [ 06/Mar/2015:10:27:52 +0100] “GET /customer/login/
HTTP/1.1” 200 8310 3.042 3.098 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:53 +0100] “GET /customer/login/
HTTP/1.1” 200 8310 0.051 0.108 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:54 +0100] “GET /customer/login/
HTTP/1.1” 200 8310 0.038 0.096 “-” “Echoping/6.0.2”
x.x.x.x - - [06/Mar/2015:10:27:58 +0100] “POST
/customer/login/?next=/customer/home/ HTTP/1.1” 302 5 0.123 0.123

But then the corresponding Gunicorn logs shows normal response times
(the figure after ‘None’, in µs) (Corresponding line marked blue):

11.22.33.44 - - [06/Mar/2015:10:27:41] “GET /customer/login/ HTTP/1.0”
200 None 41686 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:42] “GET /customer/login/ HTTP/1.0”
200 None 27629 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:43] “GET /customer/login/ HTTP/1.0”
200 None 28143 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:44] “GET /customer/login/ HTTP/1.0”
200 None 41846 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:45] “GET /customer/login/ HTTP/1.0”
200 None 30192 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:46] “GET /customer/login/ HTTP/1.0”
200 None 59382 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:48] “GET /customer/login/ HTTP/1.0”
200 None 33308 “-” “Echoping/6.0.2”
11.22.33.44 - - [ 06/Mar/2015:10:27:52 ] “GET /customer/login/ HTTP/1.0”
200 None 39849 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:53] “GET /customer/login/ HTTP/1.0”
200 None 48321 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:27:54] “GET /customer/login/ HTTP/1.0”
200 None 36484 “-” “Echoping/6.0.2”
x.x.x.x - - [06/Mar/2015:10:27:58] “POST
/customer/login/?next=/customer/home/ HTTP/1.0” 302 None 122295
y.y.y.y - - [06/Mar/2015:10:28:02] “GET
/customer/login/?next=/customer/home/ HTTP/1.0” 200 None 97824
y.y.y.y - - [06/Mar/2015:10:28:03] “GET
/customer/login/?next=/customer/home/ HTTP/1.0” 200 None 78162
11.22.33.44 - - [06/Mar/2015:10:28:26] “GET /customer/login/ HTTP/1.0”
200 None 38350 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:28:27] “GET /customer/login/ HTTP/1.0”
200 None 31076 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:28:28] “GET /customer/login/ HTTP/1.0”
200 None 28536 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:28:30] “GET /customer/login/ HTTP/1.0”
200 None 30981 “-” “Echoping/6.0.2”
11.22.33.44 - - [06/Mar/2015:10:28:31] “GET /customer/login/ HTTP/1.0”
200 None 29920 “-” “Echoping/6.0.2”

As I said, there are currently 8 workers. I already increased them from
4. The log above shows that there are enough seconds between each
request that 8 workers should be able to handle it. I also created a
MySQL slow log, which doesn’t show the delays. MySQL is always fast.

Another server we have is Nginx with PHP-FPM (with 150 PHP children in
the pool), no database access. On one particular recent log of a few
hundred thousand entries, 99% of requests is done in 129ms. But one
response even took 3170ms. Its PHP proxy settings are:

location ~ .php$ {
fastcgi_split_path_info ^(.+.php)(/.+)$;

NOTE: You should have “cgi.fix_pathinfo = 0;” in php.ini

With php5-fpm:

fastcgi_pass unix:/var/run/php5-fpm.sock;
fastcgi_index index.php;
include fastcgi_params;
}

It seems something in the communication between Nginx and the service
behind it slows down sometimes, but I can’t figure out what it might be.
Any idea what it might be or how to diagnose it better?

Regards,

Wiebe C.

You are on a nginx mailing list, thus I will reply on the nginx side of
the
problem.

You can diagnose further to tell if the problem comes from nginx or from
the backend by using 2 different variables in your log message:
$request_time
http://nginx.org/en/docs/http/ngx_http_log_module.html#var_request_time
$upstream_response_time
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#var_upstream_response_time

If those values are close enough (most of the time equal), you might
then
conclude that the trouble does not come from nginx, but rather from the
backend (or communication between those).

If you want to investigate the communication level, you can set up some
tcpdump listening on the communication between nginx and the backend.
You
will need to use TCP ports to do that.
Since you are using UNIX sockets, you might want to monitor file
descriptors, but I would (totally out of thin air) suppose it might not
be
the source of your trouble, since you would have seen much more impact
if
it was.

I guess you will have to trace/dump stuff on your backend. PHP has some
slowlog capability firing up tracing in a code which takes too long to
finish. I do not know anything about Python servers, but you are not at
the
right location for questions related to those anyway.

Happy hunting,

B. R.

You also have to consider the rate your client get data from the server.
The request time is the entire time spent from the beginning of the
request
until the end of response.
So you may not have a problem on your server, just a lazy client :slight_smile:

Hello,

The $request_time and $upstream_request_time are already included. In
the log below it says ’ 3.042 3.098 '. The latter is the request time,
the former the upstream request time. It doesn’t seem to be an issue of
slow clients (also not for other log entries, they’re similar).

It’s going to be hard to diagnose if Gunicorn says the request takes 40
ms, but Nginx says it takes 3042 ms.

Hunting on…

  • Wiebe

----- Original Message -----

From: “Wandenberg P.” [email protected]
To: [email protected]
Sent: Tuesday, 10 March, 2015 1:37:09 AM
Subject: Re: Nginx upstream delays

You also have to consider the rate your client get data from the server.
The request time is the entire time spent from the beginning of the request
until the end of response.
So you may not have a problem on your server, just a lazy client :slight_smile:

On Mon, Mar 9, 2015 at 1:05 PM, B.R. < [email protected] > wrote:

You are on a nginx mailing list, thus I will reply on the nginx side of the
problem.

You can diagnose further to tell if the problem comes from nginx or from
the
backend by using 2 different variables in your log message:

$request_time

$upstream_response_time

If those values are close enough (most of the time equal), you might then
conclude that the trouble does not come from nginx, but rather from the
backend (or communication between those).

If you want to investigate the communication level, you can set up some
tcpdump listening on the communication between nginx and the backend. You
will need to use TCP ports to do that.

Since you are using UNIX sockets, you might want to monitor file
descriptors,
but I would (totally out of thin air) suppose it might not be the source of
your trouble, since you would have seen much more impact if it was.

I guess you will have to trace/dump stuff on your backend. PHP has some
slowlog capability firing up tracing in a code which takes too long to
finish. I do not know anything about Python servers, but you are not at the
right location for questions related to those anyway.

Happy hunting,


B. R.

On Mon, Mar 9, 2015 at 4:28 PM, Wiebe C. < [email protected] >
wrote:

Hello,

I have a question about sporadic long upstream response times I’m seeing
on
(two of) our Nginx servers. It’s kind of hard to show and quantify, but
I’ll
do my best.

I also added an access log to Gunicorn, so that I can see how long
Gunicorn
takes to process requests that Nginx thinks take long. Gunicorn has 8
workers. It can be seen that there is actually no delay in Gunicorn. For
instance, Nginx sees this (the long upstream response time is marked red,
3.042s):

But then the corresponding Gunicorn logs shows normal response times (the
figure after ‘None’, in µs) (Corresponding line marked blue):

None 59382 “-” “Echoping/6.0.2”
HTTP/1.0"

None 30981 “-” “Echoping/6.0.2”

11.22.33.44 - - [06/Mar/2015:10:28:31] “GET /customer/login/ HTTP/1.0”
200
None 29920 “-” “Echoping/6.0.2”

As I said, there are currently 8 workers. I already increased them from
4.
The log above shows that there are enough seconds between each request
that
8 workers should be able to handle it. I also created a MySQL slow log,
which doesn’t show the delays. MySQL is always fast.

Another server we have is Nginx with PHP-FPM (with 150 PHP children in
the
pool), no database access. On one particular recent log of a few hundred
thousand entries, 99% of requests is done in 129ms. But one response even
took 3170ms. Its PHP proxy settings are:

include fastcgi_params;

}

It seems something in the communication between Nginx and the service
behind
it slows down sometimes, but I can’t figure out what it might be. Any
idea
what it might be or how to diagnose it better?

Regards,

Hello!

On Mon, Mar 09, 2015 at 04:28:22PM +0100, Wiebe C. wrote:

Our Munin graphs show no correlation in disk latency, cpu time,
memory use, etc.

I also added an access log to Gunicorn, so that I can see how
long Gunicorn takes to process requests that Nginx thinks take
long. Gunicorn has 8 workers. It can be seen that there is
actually no delay in Gunicorn. For instance, Nginx sees this
(the long upstream response time is marked red, 3.042s):

The 3s time suggests there is a packetloss somewhere.


Maxim D.
http://nginx.org/

Then it means nginx waited 3.042 seconds after having finished sending
the
request to the backend (ie time waiting for an answer).
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#var_upstream_response_time

Try to get the definition of the time you mention from Gunicorn. Times
could be completely different from what you think they mean.

Have you tried to use tcpdump on the frontend/backend communication?
Does
it show anything interesting?
If the latency comes from neither ends specifically but from the
communication channel itself, you will need to dig further.
​If it comes fro mthe backend, it might also be some waiting time
(database? buffering?), which might not taken into account in the time
you
mentioned.

The only certainty is: nginx is responsible for an overhead of 56ms on
this
request. Definitely not a problem on its side, I would conclude.

B. R.

Does the time reported by Gunicorn match the upstream time reported by
nginx for the faulty request?

  • If yes, then the slowdown comes from Gunicorn (or most probably from
    the
    application within)
  • If no, then the shallow waters between front and backend needs
    inspection
    by whatever means available, system wide since you are using file
    descriptors and inodes from the filesystem with the UNIX sockets

I suggest you monitor the whole machine, which includes the whole OS and
what other processes do, and the whole impact on resources. You will
need
to add what happens elsewhere on the machine if you are using a shared
virtualized environment.
Sometimes stuff appearing to be loosely coupled interfere with each
other.
:o)

Once again, to me, the trouble definitely does not come from nginx.

B. R.

Hello,

The definition of the gunicorn time I mentioned is ‘request time in
microseconds’. Because Gunicorn only talks to Nginx, this would be the
time Gunicorn needs (after having received the request) to generate the
response and send it back to nginx, I would say. In most cases, this
time matches the nginx upstream response time closely. There is a few
milliseconds overhead, usually.

I haven’t tried a tcpdump. This is going to require some time to setup,
because the long delay time is not reproducible. I need to log a day’s
worth of data and then devise some useful filter. I’ll try to set
something up.

As for Maxim’s response about packet loss; it would have to be packet
loss on a file socket. Is this even possible?

  • Wiebe

----- Original Message -----

From: “B.R.” [email protected]
To: “nginx ML” [email protected]
Sent: Tuesday, 10 March, 2015 1:28:03 PM
Subject: Re: Nginx upstream delays

Then it means nginx waited 3.042 seconds after having finished sending the
request to the backend (ie time waiting for an answer).

http://nginx.org/en/docs/http/ngx_http_upstream_module.html#var_upstream_response_time

Try to get the definition of the time you mention from Gunicorn. Times could
be completely different from what you think they mean.

Have you tried to use tcpdump on the frontend/backend communication? Does it
show anything interesting?
If the latency comes from neither ends specifically but from the
communication channel itself, you will need to dig further.
​If it comes fro mthe backend, it might also be some waiting time (database?
buffering?), which might not taken into account in the time you mentioned.

The only certainty is: nginx is responsible for an overhead of 56ms on this
request. Definitely not a problem on its side, I would conclude.

B. R.

On Tue, Mar 10, 2015 at 10:23 AM, Wiebe C. < [email protected] >
wrote:

Hello,

The $request_time and $upstream_request_time are already included. In the
log
below it says ’ 3.042 3.098 '. The latter is the request time, the former
the upstream request time. It doesn’t seem to be an issue of slow clients
(also not for other log entries, they’re similar).

It’s going to be hard to diagnose if Gunicorn says the request takes 40 ms,
but Nginx says it takes 3042 ms.

Hunting on…

  • Wiebe

On Mon, Mar 9, 2015 at 1:05 PM, B.R. < [email protected] > wrote:

You are on a nginx mailing list, thus I will reply on the nginx side of
the
problem.

$upstream_response_time

If those values are close enough (most of the time equal), you might
then
conclude that the trouble does not come from nginx, but rather from the
backend (or communication between those).

source
of
your trouble, since you would have seen much more impact if it was.

I guess you will have to trace/dump stuff on your backend. PHP has some
slowlog capability firing up tracing in a code which takes too long to
finish. I do not know anything about Python servers, but you are not at
the
right location for questions related to those anyway.

On Mon, Mar 9, 2015 at 4:28 PM, Wiebe C. < [email protected] >
wrote:

Hello,

Our
Munin
graphs show no correlation in disk latency, cpu time, memory use,
etc.

200 8310 0.035 0.092 “-” “Echoping/6.0.2”

11.22.33.44 - - [06/Mar/2015:10:27:54 +0100] "GET /customer/login/

But then the corresponding Gunicorn logs shows normal response times
(the
figure after ‘None’, in µs) (Corresponding line marked blue):

HTTP/1.0"
None 28143 “-” “Echoping/6.0.2”

11.22.33.44 - - [06/Mar/2015:10:27:48] "GET /customer/login/
200

/customer/login/?next=/customer/home/

11.22.33.44 - - [06/Mar/2015:10:28:28] "GET /customer/login/
200

NOTE: You should have “cgi.fix_pathinfo = 0;” in php.ini

}

It seems something in the communication between Nginx and the service
behind
it slows down sometimes, but I can’t figure out what it might be. Any
idea
what it might be or how to diagnose it better?

Regards,