Nginx, php-fpm and 502 errors

Hello,

I’m trying to understand why some of our production nginx/php-fpm
servers frequently return 502 errors. At that time “writev() failed
(107: Transport endpoint is not connected) while sending request to
upstream” is logged into error log.

Runnng strace -e connect,writev on nginx worker process frequently
shows:

connect(75, {sa_family=AF_FILE, path=“/tmp/php-fpm.sock”}, 110) = 0
writev(75,
[{“\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0\1\4\0\1\6\260\0\0\17”…, 1752}],

  1. = 1752
    connect(759, {sa_family=AF_FILE, path=“/tmp/php-fpm.sock”}, 110) = 0
    writev(759,
    [{“\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0\1\4\0\1\5\377\1\0\17”…, 1576}],
  2. = 1576
    connect(940, {sa_family=AF_FILE, path=“/tmp/php-fpm.sock”}, 110) = -1
    EAGAIN (Resource temporarily unavailable)
    connect(996, {sa_family=AF_FILE, path=“/tmp/php-fpm.sock”}, 110) = -1
    EAGAIN (Resource temporarily unavailable)
    connect(391, {sa_family=AF_FILE, path=“/tmp/php-fpm.sock”}, 110) = -1
    EAGAIN (Resource temporarily unavailable)
    connect(1120, {sa_family=AF_FILE, path=“/tmp/php-fpm.sock”}, 110) = -1
    EAGAIN (Resource temporarily unavailable)
    writev(996,
    [{“\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0\1\4\0\1\6O\1\0\0172S”…, 1656}],
  3. = -1 ENOTCONN (Transport endpoint is not connected)
    writev(758, [{“HTTP/1.1 502 Bad Gateway\r\nServer”…, 157}], 1) = 157
    writev(940,
    [{“\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0\1\4\0\1\6\320\0\0\17”…, 1784}],
  4. = -1 ENOTCONN (Transport endpoint is not connected)
    writev(658, [{“HTTP/1.1 502 Bad Gateway\r\nServer”…, 157}], 1) = 157

Could these EAGAIN from connect be related to ENOTCONN from writev?

The frequency of these increase if I decrease the number of php-cgi
processes and vice-versa. If I have enough php-cgi processes, they do
not occur at all. But this “enough” number is way too high for my taste
(I ended with 64).

What exactly happens that writev looses connection in the middle of the
writing?

Jure Pečar
http://jure.pecar.org

On Tue, Nov 20, 2007 at 04:17:00PM +0100, Jure Pe??ar wrote:

connect(391, {sa_family=AF_FILE, path="/tmp/php-fpm.sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)
connect(1120, {sa_family=AF_FILE, path="/tmp/php-fpm.sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)
writev(996, [{"\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0\1\4\0\1\6O\1\0\0172S"…, 1656}], 1) = -1 ENOTCONN (Transport endpoint is not connected)
writev(758, [{“HTTP/1.1 502 Bad Gateway\r\nServer”…, 157}], 1) = 157
writev(940, [{"\1\1\0\1\0\10\0\0\0\1\0\0\0\0\0\0\1\4\0\1\6\320\0\0\17"…, 1784}], 1) = -1 ENOTCONN (Transport endpoint is not connected)
writev(658, [{“HTTP/1.1 502 Bad Gateway\r\nServer”…, 157}], 1) = 157

Could these EAGAIN from connect be related to ENOTCONN from writev?

Partially. Usually connect() to an unix stream is established at once
because
it’s localhost. It’s seem there is shortage of some resources.

By the way, it’s starnge that Linux returns EAGAIN instead of
EINPROGRESS.
It’s also strange that Linux does not return ENOTCONN error via
getsockopt(SO_ERROR).

The frequency of these increase if I decrease the number of php-cgi processes and vice-versa. If I have enough php-cgi processes, they do not occur at all. But this “enough” number is way too high for my taste (I ended with 64).

What exactly happens that writev looses connection in the middle of the writing?

It’s not the middle. It’s first FastCGI packet: “\1\1\0\1\0\10\0\0…”
The scenario is following:

connect() returns EAGAIN,
nginx adds socket to epoll
epoll reports about some condition (may be an error) on the socket
nginx writev()s FastCGI request and the writev() returns ENOTCONN.

On Tue, 20 Nov 2007 19:16:39 +0300
Igor S. [email protected] wrote:

By the way, it’s starnge that Linux returns EAGAIN instead of EINPROGRESS.

I’ve read somewhere that this is one of those “linux knows better than
posix” things …

It’s also strange that Linux does not return ENOTCONN error via
getsockopt(SO_ERROR).

Don’t know about this … This nginx runs inside Virtuozzo VPS, could
this
have some effect?

It’s not the middle. It’s first FastCGI packet: “\1\1\0\1\0\10\0\0…”
The scenario is following:

connect() returns EAGAIN,
nginx adds socket to epoll
epoll reports about some condition (may be an error) on the socket
nginx writev()s FastCGI request and the writev() returns ENOTCONN.

I see. Is is possible/feasible to check what epoll reports before doing
writev()?

Our situation is like this:
openads with 500-1000 hits per second
still uses mysql with myisam (table locks!)
every now and then some db operation locks table for more than a few
miliseconds
php-cgi processes get stuck waiting for database
nginx spams error log, because fixed sized pool of php processes wont
talk
to it anymore

… is there anything useful to do on the nginx side?
Or should I just set /dev/null for error log? :wink:

Jure Pečar
http://jure.pecar.org/

On Tue, Nov 20, 2007 at 07:45:59PM +0100, Jure Pe??ar wrote:

Don’t know about this … This nginx runs inside Virtuozzo VPS, could this
have some effect?

I do not know.

It’s not the middle. It’s first FastCGI packet: “\1\1\0\1\0\10\0\0…”
The scenario is following:

connect() returns EAGAIN,
nginx adds socket to epoll
epoll reports about some condition (may be an error) on the socket
nginx writev()s FastCGI request and the writev() returns ENOTCONN.

I see. Is is possible/feasible to check what epoll reports before doing
writev()?

Unlike kqueue, epoll can not report any errors.
On other side, nginx writes kqueue error reports on the same logging
level as read()/write()'s errors.

Or should I just set /dev/null for error log? :wink:
The error_log exists to report errors. Otherwise you will get Windows -
something goes wrong but you do not know why, you may only press
the "OK "button.

You may try to set “alert” level:

   error_log   /path/to/log   alert;

in this location.

Hello Jure,

Wednesday, November 21, 2007, 2:05:22 PM, you wrote:

On Tue, 20 Nov 2007 22:31:10 +0300
Igor S. [email protected] wrote:

The error_log exists to report errors. Otherwise you will get Windows -
something goes wrong but you do not know why, you may only press
the "OK "button.

I understand - nobody wants that :wink:

You may try to set “alert” level:

   error_log   /path/to/log   alert;

in this location.

Did that - seems to be the best thing to do right now.

Can we do anything else in situations where nginx doesn’t have any free php cgi processes left but wants to use them?

Try to increase backlog in the fcgi spawn utility.

On Tue, 20 Nov 2007 22:31:10 +0300
Igor S. [email protected] wrote:

The error_log exists to report errors. Otherwise you will get Windows -
something goes wrong but you do not know why, you may only press
the "OK "button.

I understand - nobody wants that :wink:

You may try to set “alert” level:

   error_log   /path/to/log   alert;

in this location.

Did that - seems to be the best thing to do right now.

Can we do anything else in situations where nginx doesn’t have any free
php cgi processes left but wants to use them?

Thanks for help,

Jure Pečar
http://jure.pecar.org

On Wed, 21 Nov 2007 14:23:05 +0600
“Denis F. Latypoff” [email protected] wrote:

Try to increase backlog in the fcgi spawn utility.

We use php-fpm patch for php, where backlog is set to -1 by default.
Thanks for hint, will examine what that means :slight_smile:

Jure Pečar
http://jure.pecar.org

Hello Jure,

Wednesday, November 21, 2007, 2:27:30 PM, you wrote:

On Wed, 21 Nov 2007 14:23:05 +0600
“Denis F. Latypoff” [email protected] wrote:

Try to increase backlog in the fcgi spawn utility.

We use php-fpm patch for php, where backlog is set to -1 by default. Thanks
for hint, will examine what that means :slight_smile:

I thinks it’s means that listen(2) will use SOMAXCONN which is can be
set by
sysctl:

Linux: net.core.somaxconn (default is 128)
FreeBSD: kern.ipc.somaxconn (default is 128)

On Wed, Nov 21, 2007 at 09:27:30AM +0100, Jure Pe??ar wrote:

On Wed, 21 Nov 2007 14:23:05 +0600
“Denis F. Latypoff” [email protected] wrote:

Try to increase backlog in the fcgi spawn utility.

We use php-fpm patch for php, where backlog is set to -1 by default. Thanks for hint, will examine what that means :slight_smile:

It seems that -1 backlog is 0 for Linux.
Recently there was a change in nginx listen backlog. Now it’s 511 for
all platforms except FreeBSD, where it’s still -1.