Seeing some odd behaviour with 0.8.11.1 (only running it on one
production
machine so far)… has a change been made to connection handling or
status
output?
2009/08/29 23:15:52 [alert] 8716#0: open socket #152 left in connection
105
What functionality do you use - proxy/fastcgi/memcached/etc ?
proxy and fastcgi, both of which appeared to be working fine. Weirdly,
on my
WordPress hosts I was correctly receiving the page (PHP via fastcgi),
but
not some of the static files, so it would just sit there waiting for
ages…
perhaps keepalive requests were failing?
Static files on other hosts were working (one static html file delivered
by
nginx + a bunch of generated images behind the nginx proxy – collectd).
Build differences between my 0.8.10 and 0.8.11.1 packages -> added the
geoip
and static_gzip modules.
(For now, I’ve switched back to 0.8.10, but can do more testing with
0.8.11.1 if required.)
On Sun, Aug 30, 2009 at 12:33:02AM +1000, Jeff W. wrote:
Are any alerts in error_log:
grep alert /path/to/log;
Lots of:
2009/08/29 23:15:52 [alert] 8716#0: open socket #152 left in connection 105
This is probably after reconfiguraiton.
Build differences between my 0.8.10 and 0.8.11.1 packages -> added the geoip
and static_gzip modules.
(For now, I’ve switched back to 0.8.10, but can do more testing with
0.8.11.1 if required.)
There are signinficant changes in 0.8.11, but you case is strange: I’am
testing 0.8.11 on a host with 400r/s with a lot of proxied SSIed
subrequests and static files, and there is no connection leak.
2009/08/29 23:15:52 [alert] 8716#0: open socket #152 left in connection 105
This is probably after reconfiguraiton.
Sorry, yes, I should’ve checked the time -> that was from when I
reverted to
0.8.10… there no alerts prior to that.
There are signinficant changes in 0.8.11, but you case is strange: I’am
testing 0.8.11 on a host with 400r/s with a lot of proxied SSIed
subrequests and static files, and there is no connection leak.
Have you built any third party modules ?
None.
btw, the majority of my non-static traffic is fastcgi over unix socket.
I’m
going to do the upgrade again and watch the activity.
On Sun, Aug 30, 2009 at 11:00:10AM -0400, Jim O. wrote:
This is probably after reconfiguraiton.
Have you built any third party modules ?
enable coredumps
do not forget to set NN from log message.
OS is CentOS 5.3 Linux - 2.6.18-028stab062.3
files)
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
2009/08/30 10:29:00 [alert] 2042#0: open socket #1017 left in connection
1010
Other servers seem to be running fine including ones with busy sites.
For the moment I have reverted that VPS to 0.8.10.
Could you do the following:
enable coredumps
set in nginx.conf:
debug_points abort;
reconfigure nginx, if there are open connections, then nginx creates
coredump on exit
look in log for alerts: open socket #… left in connection NN
run “gdb /path/to/nginx /path/to/core”, then
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *)
ngx_cycle->connections[NN]->data)->main->count
OS is CentOS 5.3 Linux - 2.6.18-028stab062.3
files)
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
2009/08/30 10:29:00 [alert] 2042#0: open socket #1017 left in connection
1010
Other servers seem to be running fine including ones with busy sites.
Do you want nginx reconfigured “–with-debug” or is there another option
you need?
look in log for alerts: open socket #… left in connection NN
run “gdb /path/to/nginx /path/to/core”, then
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->main->count
Sorry, yes, I should’ve checked the time -> that was from when I reverted to
coredump on exit
4) look in log for alerts: open socket #… left in connection NN
5) run “gdb /path/to/nginx /path/to/core”, then
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->main->count
do not forget to set NN from log message.
Igor,
On one server I seem to be seeing the same issue. Twice since upgrading
to 0.8.11 I have found the site unresponsive. It’s a relatively slow
vBulletin site on a very small VPS. We serve a fair bit of static
content (photos mostly) and PHP via fastcgi.
OS is CentOS 5.3 Linux - 2.6.18-028stab062.3
sample of error log (it grew to 36 GB in a few hours):
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
2009/08/30 07:50:20 [alert] 2043#0: accept() failed (24: Too many open
files)
…
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
2009/08/30 10:29:00 [alert] 2042#0: accept() failed (24: Too many open
files)
When I did shut it down I saw hundreds of lines like:
2009/08/30 10:29:00 [alert] 2042#0: open socket #1010 left in connection
1002
2009/08/30 10:29:00 [alert] 2042#0: open socket #981 left in connection
1003
2009/08/30 10:29:00 [alert] 2042#0: open socket #994 left in connection
1004
2009/08/30 10:29:00 [alert] 2042#0: open socket #1017 left in connection
1005
2009/08/30 10:29:00 [alert] 2042#0: open socket #1012 left in connection
1006
2009/08/30 10:29:00 [alert] 2042#0: open socket #1011 left in connection
1007
2009/08/30 10:29:00 [alert] 2042#0: open socket #1016 left in connection
1008
2009/08/30 10:29:00 [alert] 2042#0: open socket #1014 left in connection
1009
2009/08/30 10:29:00 [alert] 2042#0: open socket #1019 left in connection
1010
2009/08/30 10:29:00 [alert] 2042#0: open socket #1020 left in connection
1011
2009/08/30 10:29:00 [alert] 2042#0: open socket #1013 left in connection
1012
2009/08/30 10:29:00 [alert] 2042#0: open socket #995 left in connection
1013
2009/08/30 10:29:00 [alert] 2042#0: open socket #988 left in connection
1014
2009/08/30 10:29:00 [alert] 2042#0: open socket #1023 left in connection
1015
2009/08/30 10:29:00 [alert] 2042#0: aborting
Other servers seem to be running fine including ones with busy sites.
For the moment I have reverted that VPS to 0.8.10.
reconfigure nginx, if there are open connections, then nginx creates
coredump on exit
When you say ‘reconfigure’, do you mean ‘nginx -s reload’? I’m seeing an
‘aborted’ line in the error log, but have yet to see a coredump (which I
have enabled and made sure it works with kill -11).
Unfortunately I don’t think it gave too much information.
I watched connections gradually rise. I have ulimit -n set to 1024, two
workers, 1024 connections/worker. As connections neared 2048 the site
became unresponsive and load went up dramatically.
I began to see the same errors in the log. Nginx did not abort on its
own so I killed it after a few minutes. I then saw the same entries in
the error log like:
2009/08/30 22:22:40 [alert] 6118#0: open socket #980 left in connection
993
I ran gdb on the core but this was the output from three connections:
On Sun, Aug 30, 2009 at 10:55:57PM -0400, Jim O. wrote:
Do you want nginx reconfigured “–with-debug” or is there another option
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
I began to see the same errors in the log. Nginx did not abort on its
own so I killed it after a few minutes. I then saw the same entries in
the error log like:
2009/08/30 22:22:40 [alert] 6118#0: open socket #980 left in connection 993
nginx aborts only when you send -HUP and it found leaked connections.
This GDB was configured as “x86_64-redhat-linux-gnu”…Using host
(gdb) p ((ngx_connection_t *) ngx_cycle->connections[1010]->data)->uri
ngx_cycle->connections[993]->data)->main->count
Cannot access memory at address 0x130
(gdb) quit
[root@mars proc]#
During this time there were hundreds of connections in “CLOSE_WAIT”
state. They gradually increased to just over 1000 when it crashed.
Sorry, I’ve mistaked:
p ((ngx_http_request_t *) ngx_cycle->connections[1014].data)->uri
p ((ngx_http_request_t *)
ngx_cycle->connections[1014].data)->main->count
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->uri
p ((ngx_connection_t *) ngx_cycle->connections[NN]->data)->main->count
Log output:
2009/08/31 17:10:39 [notice] 24934#0: signal process started
2009/08/31 17:10:39 [alert] 24927#0: open socket #52 left in connection
4
2009/08/31 17:10:39 [alert] 24927#0: open socket #54 left in connection
6
2009/08/31 17:10:39 [alert] 24927#0: open socket #57 left in connection
9
2009/08/31 17:10:39 [alert] 24927#0: open socket #63 left in connection
11
2009/08/31 17:10:39 [alert] 24927#0: open socket #60 left in connection
12
2009/08/31 17:10:39 [alert] 24927#0: open socket #58 left in connection
13
2009/08/31 17:10:39 [alert] 24927#0: aborting
2009/08/31 17:10:39 [alert] 24924#0: worker process 24927 exited on
signal 6
(core dumped)
Using the gdb lines from your later followup:
Core was generated by `nginx: worker pr’.
Program terminated with signal 6, Aborted.
[New process 24927] #0 0xf57fe416 in __kernel_vsyscall ()
(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[13].data)->uri
$1 = {len = 32, data = 0x9416063 “/wp-includes/js/jquery/jquery.js”}
(gdb) p ((ngx_http_request_t *)
ngx_cycle->connections[13].data)->main->count
$2 = 1
…
(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->uri
$6 = {len = 10, data = 0x9411493
“/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/”}
(gdb) p ((ngx_http_request_t *)
ngx_cycle->connections[9].data)->main->count
$7 = 2
On Mon, Aug 31, 2009 at 05:14:57PM +1000, Jeff W. wrote:
2009/08/31 17:10:39 [alert] 24927#0: open socket #57 left in connection 9
Core was generated by `nginx: worker pr’.
(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->uri
$6 = {len = 10, data = 0x9411493
“/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/”}
(gdb) p ((ngx_http_request_t *) ngx_cycle->connections[9].data)->main->count
$7 = 2
Anything else you want me to poke at?
As I understand “/wp-includes/js/jquery/jquery.js” is static
while “”/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/" is
fastcgi handled request ?
Could you post a configuration part related to these request ?
As I understand “/wp-includes/js/jquery/jquery.js” is static
while “”/index.php?q=/blog/2006/03/15/requiem-for-linux-journal/" is
fastcgi handled request ?
That’s right.
Could you post a configuration part related to these request ?
}
The point behind the above logic is to try and serve a file first from
the
domain’s root (such as favicon.ico), then from the generic
$wordpress_root
for files provided by the WordPress distribution (such as everything
under
/wp-includes as with the .js example in the debugging), then finally
pass
anything else to PHP.