Accept() failed (53: Software caused connection abort)

luislavena · July 13, 2011, 11:07am

Dear List,

We’ve switched to Nginx because it rocks!

I have a problem with a really heavy loaded webserver:

It has 100,000,000 request per day and this number is increasing.
(Probably 200-300M would be the peak) We are using it with php-fcgi
backend.

We have 2 serious constant error messages, can be found below [1]

[1] nginx error - Pastebin.com

writev() failed (54: Connection reset by peer)
accept() failed (53: Software caused connection abort)

When this error occurs, the 50x page is shown OR totally disconnects and
“could not connect to server” happens.

Our system is a:

FreeBSD iridium 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51
UTC 2011 [email protected]:/usr/obj/usr/src/sys/GENERIC
amd64

HP Proliant G7,

32GB memory
CPU: Intel(R) Xeon(R) CPU @ 2.40GHz x 6 ( x 2 ) = 24 CPU
SAS disks

What do the error messages above mean? Has it problem with the nginx, or
the PHP, or kernel limits? (TCP limits?)

What should I tuning? What should I set up?

The nginx relevant config parts are:

worker_processes 4;
events {
worker_connections 4096;
}
upstream mybackend {
server 127.0.0.1:9000;

server 127.0.0.1:9001;

server 127.0.0.1:9002;

}

 sendfile        on;
 keepalive_requests 0;


     location ~ \.php$ {
         fastcgi_pass   mybackend;
         root           html;
         include        fastcgi_params;
         fastcgi_index  index.php;
     }

Thanks in advance,

–
Adam PAPAI
Grapes Communication Ltd.
http://www.grapes.hu
E-mail: [email protected]

Adam_PAPAI · July 13, 2011, 11:53am

Hello!

On Wed, Jul 13, 2011 at 11:07:03AM +0200, Adam PAPAI wrote:

We have 2 serious constant error messages, can be found below [1]

nginx, or the PHP, or kernel limits? (TCP limits?)
In no particular order:

2011/07/13 10:20:58 [error] 38061#0: accept() failed (53: Software
caused connection abort)

Client closed connection before nginx was able to accept() it.
This may be normal (i.e. user just closed a browser page while images
was still loading) and may be not (i.e. nginx wasn’t able to
accept() for a long time due to some problems and client bored
waiting and closed a page).

Some number of such messages are expected to appear. High number
of such errors may indicate problems, try looking at listen queues
at first (netstat -Lan).

2011/07/13 10:26:56 [error] 38059#0: *73523 writev() failed (54:
Connection reset by peer) while sending request to upstream …

Your upstream reset connection. This is a problem with your
backend (most likely it just died). Inspect backend logs.

2011/07/13 10:31:33 [error] 38060#0: *113640 writev() failed (32: Broken
pipe) while sending request to upstream …

The same as above, but in slightly different moment.

What should I tuning? What should I set up?

From here it looks like you have problems with backend(s).

Maxim D.

Adam_PAPAI · July 13, 2011, 12:11pm

Maxim D. wrote:

It has 100,000,000 request per day and this number is increasing.
When this error occurs, the 50x page is shown OR totally disconnects

CPU: Intel(R) Xeon(R) CPU @ 2.40GHz x 6 ( x 2 ) = 24 CPU
This may be normal (i.e. user just closed a browser page while images
Your upstream reset connection. This is a problem with your
backend (most likely it just died). Inspect backend logs.

2011/07/13 10:31:33 [error] 38060#0: *113640 writev() failed (32: Broken pipe)
while sending request to upstream …

The same as above, but in slightly different moment.

What should I tuning? What should I set up?

From here it looks like you have problems with backend(s).

netstat -Lan shows:

Current listen queue sizes (qlen/incqlen/maxqlen)
Proto Listen Local Address
tcp4 0/0/4096 *.80
tcp4 27/0/128 127.0.0.1.9002
tcp4 32/0/128 127.0.0.1.9001
tcp4 26/0/128 127.0.0.1.9000
tcp4 0/0/128 *.2818
tcp6 0/0/128 *.2818
tcp4 0/0/10 127.0.0.1.25
Some tcp sockets may have been created.
unix 0/0/4 /var/run/devd.pipe

but sometimes the qlen over 128 for the 3 php backend. It should be the
problem I guess… hmmm.

–
Adam PAPAI
Szoftverfejlesztsi igazgat
Grapes Communication Ltd.
http://www.grapes.hu
E-mail: [email protected]
Phone: +36 30 33-55-735 (Hungary)

Adam_PAPAI · July 13, 2011, 1:42pm

Maxim D. wrote:

Hello!

Client closed connection before nginx was able to accept() it.
This may be normal (i.e. user just closed a browser page while images
was still loading) and may be not (i.e. nginx wasn’t able to
accept() for a long time due to some problems and client bored
waiting and closed a page).

Some number of such messages are expected to appear. High number
of such errors may indicate problems, try looking at listen queues
at first (netstat -Lan).

Dear Maxim,

When the netstat -Lan shows over:

tcp4 190/0/128 127.0.0.1.9002
tcp4 195/0/128 127.0.0.1.9001
tcp4 181/0/128 127.0.0.1.9000

It starts to throw the writev() failed (54: Connection reset by peer)
while sending request errors.

What should I increase? Any ideas? What is the real meaning of the
maxqlen? If it the qlen is greater than maxqlen it means…?

–
Adam PAPAI
Grapes Communication Ltd.
http://www.grapes.hu
E-mail: [email protected]

Adam_PAPAI · July 13, 2011, 3:27pm

Hello!

On Wed, Jul 13, 2011 at 01:41:59PM +0200, Adam PAPAI wrote:

Some number of such messages are expected to appear. High number
tcp4 181/0/128 127.0.0.1.9000

It starts to throw the writev() failed (54: Connection reset by
peer) while sending request errors.

What should I increase? Any ideas? What is the real meaning of the
maxqlen? If it the qlen is greater than maxqlen it means…?

Maxqlen is maximum listen socket queue length (the queue of
connections completed handshake but not yet accept()'ed by
application), and FreeBSD will reset new connections if it’s
exhausted (or will just drop them if you have
net.inet.tcp.syncache.rst_on_sock_fail=0).

If you see qlen greater than maxqlen it means that your app can’t
cope with load. If this happens for a fraction of a second due to
connection bursts - this may be ok and you just need to increase
queue length to compensate bursts.

But if you’ve been able to see it in netstat output - it certainly
means there is something wrong with backends. You either have to
add more backends or find out/optimize bottlenecks in existing ones.

Maxim D.

Adam_PAPAI · July 13, 2011, 4:22pm

Hello!

On Wed, Jul 13, 2011 at 04:10:52PM +0200, Adam PAPAI wrote:

If you see qlen greater than maxqlen it means that your app can’t
cope with load. If this happens for a fraction of a second due to
connection bursts - this may be ok and you just need to increase
queue length to compensate bursts.

The question is now: how to increase the queue length under FreeBSD.
I cannot find the config value in sysctl.

System limit is kern.ipc.somaxconn. But as far as I remember you
have 4096 queue for nginx, so it looks like you’ve already tuned
it.

The question is how to tune it in your backend, but there is no
simple answer: this depends on backend. E.g. php-fpm should use
kern.ipc.somaxconn as seen on process startup by default and have
an option (“backlog”) to tune it.

But again: as long as you’ve been able to see overflows, just
increasing listen queue isn’t likely to help. You have to add
more backends.

Maxim D.

Adam_PAPAI · July 13, 2011, 4:11pm

Maxim D. wrote:

Hello!

Maxqlen is maximum listen socket queue length (the queue of
connections completed handshake but not yet accept()'ed by
application), and FreeBSD will reset new connections if it’s
exhausted (or will just drop them if you have
net.inet.tcp.syncache.rst_on_sock_fail=0).

If you see qlen greater than maxqlen it means that your app can’t
cope with load. If this happens for a fraction of a second due to
connection bursts - this may be ok and you just need to increase
queue length to compensate bursts.

The question is now: how to increase the queue length under FreeBSD. I
cannot find the config value in sysctl.

–
Adam PAPAI
Szoftverfejlesztsi igazgat
Grapes Communication Ltd.
http://www.grapes.hu
E-mail: [email protected]
Phone: +36 30 33-55-735 (Hungary)

Adam_PAPAI · July 14, 2011, 11:29am

Maxim D. wrote:

exhausted (or will just drop them if you have
have 4096 queue for nginx, so it looks like you’ve already tuned

Maxim D.

It seems after 2 days investigation, we’ve found the main problem: I/O.

After rewriting the PHP code to avoid file based session handling, to
memory based session handling everything started to work without error
messages.

It seems the I/O influences heavily the web applications, even if it’s
only a small and fast PHP code.

Thanks everything.

–
Adam PAPAI
Grapes Communication Ltd.
http://www.grapes.hu
E-mail: [email protected]

Adam_PAPAI · July 13, 2011, 4:44pm

Maxim D. wrote:

Hello!

increasing listen queue isn’t likely to help. You have to add
more backends.

Maxim D.

It was hardcoded. I’ve changed in the backend code. Now it’s 512.

Thanks for all, I hope I can tune it well with this knowledge

–
Adam PAPAI
Grapes Communication Ltd.
http://www.grapes.hu
E-mail: [email protected]