Weird 0.8.11.1 connections spike

jwigal · August 29, 2009, 2:57pm

Howdy,

Seeing some odd behaviour with 0.8.11.1 (only running it on one
production
machine so far)… has a change been made to connection handling or
status
output?

Active connections chart from the last hour:

http://www.gnome.org/~jdub/2009/nginx-0.8.11-hour.png

As you can see it’s incredibly out of place based on the daily activity:

http://www.gnome.org/~jdub/2009/nginx-0.8.11-day.png

Any thoughts?

Thanks,

Jeff

jwigal · August 29, 2009, 4:04pm

On Sat, Aug 29, 2009 at 10:50:43PM +1000, Jeff W. wrote:

As you can see it’s incredibly out of place based on the daily activity:

http://www.gnome.org/~jdub/2009/nginx-0.8.11-day.png

Any thoughts?

Are any alerts in error_log:

grep alert /path/to/log;

?

What functionality do you use - proxy/fastcgi/memcached/etc ?

jwigal · August 29, 2009, 4:44pm

Are any alerts in error_log:

grep alert /path/to/log;

Lots of:

2009/08/29 23:15:52 [alert] 8716#0: open socket #152 left in connection
105

What functionality do you use - proxy/fastcgi/memcached/etc ?

proxy and fastcgi, both of which appeared to be working fine. Weirdly,
on my
WordPress hosts I was correctly receiving the page (PHP via fastcgi),
but
not some of the static files, so it would just sit there waiting for
ages…
perhaps keepalive requests were failing?

Static files on other hosts were working (one static html file delivered
by
nginx + a bunch of generated images behind the nginx proxy – collectd).

Build differences between my 0.8.10 and 0.8.11.1 packages -> added the
geoip
and static_gzip modules.

(For now, I’ve switched back to 0.8.10, but can do more testing with
0.8.11.1 if required.)

Thanks,

Jeff

jwigal · August 29, 2009, 5:22pm

On Sun, Aug 30, 2009 at 12:33:02AM +1000, Jeff W. wrote:

Are any alerts in error_log:

grep alert /path/to/log;

Lots of:

2009/08/29 23:15:52 [alert] 8716#0: open socket #152 left in connection 105

This is probably after reconfiguraiton.

Build differences between my 0.8.10 and 0.8.11.1 packages -> added the geoip
and static_gzip modules.

(For now, I’ve switched back to 0.8.10, but can do more testing with
0.8.11.1 if required.)

There are signinficant changes in 0.8.11, but you case is strange: I’am
testing 0.8.11 on a host with 400r/s with a lot of proxied SSIed
subrequests and static files, and there is no connection leak.

Have you built any third party modules ?

jwigal · August 29, 2009, 6:42pm

2009/08/29 23:15:52 [alert] 8716#0: open socket #152 left in connection 105

This is probably after reconfiguraiton.

Sorry, yes, I should’ve checked the time -> that was from when I
reverted to
0.8.10… there no alerts prior to that.

There are signinficant changes in 0.8.11, but you case is strange: I’am
testing 0.8.11 on a host with 400r/s with a lot of proxied SSIed
subrequests and static files, and there is no connection leak.

Have you built any third party modules ?

None.

btw, the majority of my non-static traffic is fastcgi over unix socket.
I’m
going to do the upgrade again and watch the activity.

Thanks,

Jeff

jwigal · August 29, 2009, 6:51pm

On Sun, Aug 30, 2009 at 01:45:30AM +1000, Jeff W. wrote:

testing 0.8.11 on a host with 400r/s with a lot of proxied SSIed
subrequests and static files, and there is no connection leak.

Have you built any third party modules ?

None.

btw, the majority of my non-static traffic is fastcgi over unix socket. I’m
going to do the upgrade again and watch the activity.

Could you do the following:

enable coredumps
set in nginx.conf:
debug_points abort;
reconfigure nginx, if there are open connections, then nginx creates
coredump on exit
look in log for alerts: open socket #… left in connection NN
run “gdb /path/to/nginx /path/to/core”, then

p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *)
ngx_cycle->connections[NN]->data)->main->count

do not forget to set NN from log message.

jwigal · August 30, 2009, 5:42pm

On Sun, Aug 30, 2009 at 11:00:10AM -0400, Jim O. wrote:

This is probably after reconfiguraiton.
Have you built any third party modules ?

enable coredumps
do not forget to set NN from log message.
OS is CentOS 5.3 Linux - 2.6.18-028stab062.3
files)
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
2009/08/30 10:29:00 [alert] 2042#0: open socket #1017 left in connection
1010
Other servers seem to be running fine including ones with busy sites.
For the moment I have reverted that VPS to 0.8.10.

Could you do the following:

enable coredumps
set in nginx.conf:
debug_points abort;
reconfigure nginx, if there are open connections, then nginx creates
coredump on exit
look in log for alerts: open socket #… left in connection NN
run “gdb /path/to/nginx /path/to/core”, then

p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *)
ngx_cycle->connections[NN]->data)->main->count

where NN is NN from log message.

jwigal · August 30, 2009, 6:03pm

Igor S. wrote:

to

set in nginx.conf:

OS is CentOS 5.3 Linux - 2.6.18-028stab062.3
files)
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
2009/08/30 10:29:00 [alert] 2042#0: open socket #1017 left in connection
1010
Other servers seem to be running fine including ones with busy sites.

Do you want nginx reconfigured “–with-debug” or is there another option
you need?

look in log for alerts: open socket #… left in connection NN

run “gdb /path/to/nginx /path/to/core”, then

p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->main->count

where NN is NN from log message.

Jim

jwigal · August 30, 2009, 6:41pm

On Sun, Aug 30, 2009 at 11:52:51AM -0400, Jim O. wrote:

enable coredumps

set in nginx.conf:
debug_points abort;

reconfigure nginx, if there are open connections, then nginx creates
coredump on exit

Do you want nginx reconfigured “–with-debug” or is there another option
you need?

No. The coredump is enough, it’s just should have debug info (gcc -g
option).

jwigal · August 30, 2009, 5:10pm

Igor S. wrote:

Sorry, yes, I should’ve checked the time -> that was from when I reverted to

coredump on exit
4) look in log for alerts: open socket #… left in connection NN
5) run “gdb /path/to/nginx /path/to/core”, then

p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->main->count

do not forget to set NN from log message.

Igor,

On one server I seem to be seeing the same issue. Twice since upgrading
to 0.8.11 I have found the site unresponsive. It’s a relatively slow
vBulletin site on a very small VPS. We serve a fair bit of static
content (photos mostly) and PHP via fastcgi.

OS is CentOS 5.3 Linux - 2.6.18-028stab062.3

sample of error log (it grew to 36 GB in a few hours):

2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)

…

2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)

When I did shut it down I saw hundreds of lines like:

2009/08/30 10:29:00 [alert] 2042#0: open socket #1010 left in connection
1002
2009/08/30 10:29:00 [alert] 2042#0: open socket #981 left in connection
1003
2009/08/30 10:29:00 [alert] 2042#0: open socket #994 left in connection
1004
2009/08/30 10:29:00 [alert] 2042#0: open socket #1017 left in connection
1005
2009/08/30 10:29:00 [alert] 2042#0: open socket #1012 left in connection
1006
2009/08/30 10:29:00 [alert] 2042#0: open socket #1011 left in connection
1007
2009/08/30 10:29:00 [alert] 2042#0: open socket #1016 left in connection
1008
2009/08/30 10:29:00 [alert] 2042#0: open socket #1014 left in connection
1009
2009/08/30 10:29:00 [alert] 2042#0: open socket #1019 left in connection
1010
2009/08/30 10:29:00 [alert] 2042#0: open socket #1020 left in connection
1011
2009/08/30 10:29:00 [alert] 2042#0: open socket #1013 left in connection
1012
2009/08/30 10:29:00 [alert] 2042#0: open socket #995 left in connection
1013
2009/08/30 10:29:00 [alert] 2042#0: open socket #988 left in connection
1014
2009/08/30 10:29:00 [alert] 2042#0: open socket #1023 left in connection
1015
2009/08/30 10:29:00 [alert] 2042#0: aborting

Other servers seem to be running fine including ones with busy sites.
For the moment I have reverted that VPS to 0.8.10.

Jim

jwigal · August 31, 2009, 5:13am

reconfigure nginx, if there are open connections, then nginx creates
coredump on exit

When you say ‘reconfigure’, do you mean ‘nginx -s reload’? I’m seeing an
‘aborted’ line in the error log, but have yet to see a coredump (which I
have enabled and made sure it works with kill -11).

Thanks,

Jeff

jwigal · August 31, 2009, 5:10am

Igor S. wrote:

you need?

where NN is NN from log message.

Unfortunately I don’t think it gave too much information.

I watched connections gradually rise. I have ulimit -n set to 1024, two
workers, 1024 connections/worker. As connections neared 2048 the site
became unresponsive and load went up dramatically.

I began to see the same errors in the log. Nginx did not abort on its
own so I killed it after a few minutes. I then saw the same entries in
the error log like:

2009/08/30 22:22:40 [alert] 6118#0: open socket #980 left in connection
993

I ran gdb on the core but this was the output from three connections:

[root@mars proc]# gdb /vz/private/101/fs/root/usr/local/sbin/nginx
./kcore
GNU gdb Red Hat Linux (6.5-37.el5_2.2rh)
Copyright © 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for
details.
This GDB was configured as “x86_64-redhat-linux-gnu”…Using host
libthread_db library “/lib64/libthread_db.so.1”.

warning: core file may not match specified executable file.
Core was generated by `ro root=LABEL=/ console=tty0
console=ttyS1,19200n8 debug’.
#0 0x0000000000000000 in ?? ()
(gdb) p ((ngx_connection_t *) ngx_cycle->connections[1014]->data)->uri
Cannot access memory at address 0x130
(gdb) p ((ngx_connection_t *) ngx_cycle->connections[1014]->data)->uri
Cannot access memory at address 0x130
(gdb) p ((ngx_connection_t *) ngx_cycle->connections[1010]->data)->uri
Cannot access memory at address 0x130
(gdb) p ((ngx_connection_t *)
ngx_cycle->connections[1014]->data)->main->count
Cannot access memory at address 0x130
(gdb) p ((ngx_connection_t *)
ngx_cycle->connections[1010]->data)->main->count
Cannot access memory at address 0x130
(gdb) p ((ngx_connection_t *) ngx_cycle->connections[993]->data)->uri
Cannot access memory at address 0x130
(gdb) p ((ngx_connection_t *)
ngx_cycle->connections[993]->data)->main->count
Cannot access memory at address 0x130
(gdb) quit
[root@mars proc]#

During this time there were hundreds of connections in “CLOSE_WAIT”
state. They gradually increased to just over 1000 when it crashed.

Jim

jwigal · August 31, 2009, 7:34am

On Sun, Aug 30, 2009 at 10:55:57PM -0400, Jim O. wrote:

Do you want nginx reconfigured “–with-debug” or is there another option

p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri

I began to see the same errors in the log. Nginx did not abort on its
own so I killed it after a few minutes. I then saw the same entries in
the error log like:

2009/08/30 22:22:40 [alert] 6118#0: open socket #980 left in connection 993

nginx aborts only when you send -HUP and it found leaked connections.

This GDB was configured as “x86_64-redhat-linux-gnu”…Using host
(gdb) p ((ngx_connection_t *) ngx_cycle->connections[1010]->data)->uri
ngx_cycle->connections[993]->data)->main->count
Cannot access memory at address 0x130
(gdb) quit
[root@mars proc]#

During this time there were hundreds of connections in “CLOSE_WAIT”
state. They gradually increased to just over 1000 when it crashed.

Sorry, I’ve mistaked:

p ((ngx_http_request_t *) ngx_cycle->connections[1014].data)->uri
p ((ngx_http_request_t *)
ngx_cycle->connections[1014].data)->main->count

jwigal · August 31, 2009, 8:47am

When you say ‘reconfigure’, do you mean ‘nginx -s reload’? I’m seeing an

Yes.

‘aborted’ line in the error log, but have yet to see a coredump (which I
have enabled and made sure it works with kill -11).

Probably, you need also set in nginx.conf:

working_directory /path/where/nginx/workers/can/write;

Aha! The workers… that’s probably the issue – thanks.

Jeff

jwigal · August 31, 2009, 8:18am

On Mon, Aug 31, 2009 at 01:04:07PM +1000, Jeff W. wrote:

reconfigure nginx, if there are open connections, then nginx creates
coredump on exit

When you say ‘reconfigure’, do you mean ‘nginx -s reload’? I’m seeing an

Yes.

‘aborted’ line in the error log, but have yet to see a coredump (which I
have enabled and made sure it works with kill -11).

Probably, you need also set in nginx.conf:

working_directory /path/where/nginx/workers/can/write;

jwigal · August 31, 2009, 9:21am

p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->main->count

Log output:

2009/08/31 17:10:39 [notice] 24934#0: signal process started
2009/08/31 17:10:39 [alert] 24927#0: open socket #52 left in connection
4
2009/08/31 17:10:39 [alert] 24927#0: open socket #54 left in connection
6
2009/08/31 17:10:39 [alert] 24927#0: open socket #57 left in connection
9
2009/08/31 17:10:39 [alert] 24927#0: open socket #63 left in connection
11
2009/08/31 17:10:39 [alert] 24927#0: open socket #60 left in connection
12
2009/08/31 17:10:39 [alert] 24927#0: open socket #58 left in connection
13
2009/08/31 17:10:39 [alert] 24927#0: aborting
2009/08/31 17:10:39 [alert] 24924#0: worker process 24927 exited on
signal 6
(core dumped)

Using the gdb lines from your later followup:

Core was generated by `nginx: worker pr’.
Program terminated with signal 6, Aborted.
[New process 24927]
#0 0xf57fe416 in __kernel_vsyscall ()
(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[13].data)->uri
$1 = {len = 32, data = 0x9416063 “/wp-includes/js/jquery/jquery.js”}
(gdb) p ((ngx_http_request_t *)
ngx_cycle->connections[13].data)->main->count
$2 = 1

…

(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->uri
$6 = {len = 10, data = 0x9411493
“/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/”}
(gdb) p ((ngx_http_request_t *)
ngx_cycle->connections[9].data)->main->count
$7 = 2

Anything else you want me to poke at?

Thanks,

Jeff

jwigal · August 31, 2009, 9:29am

On Mon, Aug 31, 2009 at 05:14:57PM +1000, Jeff W. wrote:

2009/08/31 17:10:39 [alert] 24927#0: open socket #57 left in connection 9
Core was generated by `nginx: worker pr’.

(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->uri
$6 = {len = 10, data = 0x9411493
“/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/”}
(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->main->count
$7 = 2

Anything else you want me to poke at?

As I understand “/wp-includes/js/jquery/jquery.js” is static
while “”/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/" is
fastcgi handled request ?

Could you post a configuration part related to these request ?

jwigal · August 31, 2009, 10:04am

Anything else you want me to poke at?

As I understand “/wp-includes/js/jquery/jquery.js” is static
while “”/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/" is
fastcgi handled request ?

That’s right.

Could you post a configuration part related to these request ?

set $wordpress_root /srv/wordpress/root;

location / {
try_files $uri $uri/ @wordroot;
}

location @wordroot {
internal;
root $wordpress_root;
index index.php;
try_files $uri $uri/ /index.php?q=$request_uri;
expires 5y;
}

location ~ .php$ {
include fastcgi_params;
fastcgi_pass fcgi_php;
fastcgi_param SCRIPT_FILENAME $wordpress_root$fastcgi_script_name;
}

(fcgi_php is an upstream block which points to a single unix socket for
the
php5-cgi process.)

Thanks,

Jeff

jwigal · August 31, 2009, 10:43am

On Mon, Aug 31, 2009 at 05:14:57PM +1000, Jeff W. wrote:

2009/08/31 17:10:39 [alert] 24927#0: open socket #57 left in connection 9
Core was generated by `nginx: worker pr’.

(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->uri
$6 = {len = 10, data = 0x9411493
“/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/”}
(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->main->count
$7 = 2

Anything else you want me to poke at?

p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->main->blocked

jwigal · August 31, 2009, 10:46am

set $wordpress_root /srv/wordpress/root;

Should’ve mentioned:

root /srv/example.com/root;

}
The point behind the above logic is to try and serve a file first from
the
domain’s root (such as favicon.ico), then from the generic
$wordpress_root
for files provided by the WordPress distribution (such as everything
under
/wp-includes as with the .js example in the debugging), then finally
pass
anything else to PHP.

Thanks,

Jeff