Forum: NGINX Lots of CLOSE_WAIT sockets, nginx+php (WordPress site)

Posted by Vicente Aguilar (Guest)
on 2010-02-21 11:20
(Received via mailing list)
Hi

I have a WordPress-mu site (a couple personal and friends' blogs, very 
light traffic) which I migrated some months ago from lighttpd+php-fcgi 
to nginx+php-fcgi. Ever since the migration the site sometimes goes 
down, I never had the time to look into it and just programmed a script 
that monitored the site and restarted everything when it went down.

We're going to start using WP-mu at work so I've been looking into it 
lately and the problem seems to be browser-server connections stuck on 
the CLOSE_WAIT state. With netstat -nap I get loads of these:

$ netstat -nap | grep CLOSE_WAIT
tcp        1      0 10.10.10.10:80        1.2.3.4:52132     CLOSE_WAIT 
27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:52133     CLOSE_WAIT 
27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:50857     CLOSE_WAIT 
27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:51348     CLOSE_WAIT 
27673/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:50846     CLOSE_WAIT 
27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:52126     CLOSE_WAIT 
27672/nginx: worker
tcp        1      0 10.10.10.10:80        1.2.3.4:52354     CLOSE_WAIT 
27672/nginx: worker
[...]

Where 10.10.10.10 is the web server and 1.2.3.4 the browser. Right now I 
have 67 of these after having restarted nginx and doing some admin stuff 
on wp for a couple of minutes (CPU-intensive stuff, uploading, scaling 
and watermarking images with the NexGen Gallery plugin).

The connections between nginx and php doesn't seem to get stuck, they go 
from active to TIME_WAIT and disappear from netstat normally. They don't 
get stuck in the CLOSE_WAIT state:

$ netstat -nap | grep :9000
tcp        0      0 127.0.0.1:9000          0.0.0.0:* 
LISTEN      27662/php5-fpm
tcp        0      0 127.0.0.1:9000          127.0.0.1:52917 
TIME_WAIT   -
[...]

On friday I moved from spawn-fcgi+php-cgi to php-fpm to no avail. I've 
noticed some log entries on php5-fpm.log like these on the moments I'm 
working with wp and CLOSE_WAIT connections start to clog up:

Feb 21 10:48:45.080836 [NOTICE] fpm_got_signal(), line 48: received 
SIGCHLD
Feb 21 10:48:45.080918 [NOTICE] fpm_children_bury(), line 217: child 
27665 (pool default) exited with code 0 after 35512.611171 seconds from 
start
Feb 21 10:48:45.089499 [NOTICE] fpm_children_make(), line 354: child 
30370 (pool default) started

So I *guess* there might be a connection between the two. Anyway this is 
not a 1:1 ratio, right now I have 5 of those php SIGCHLD and 67 sockets 
on CLOSE_WAIT with nginx. And the php SIGCHILD relate to moments when 
I've got an error on wp (failed creating a thumbnail) while the 
CLOSE_WAIT connections are not related to application nor connectivity 
errors.

I'm almost sure that despite the CLOSE_WAIT sockets belong to the 
browser-nginx connections, the problems lies in the nginx-php 
connection. At work we have a farm of nginx+Tomcat servers (via 
proxy_pass, not fastcgi_pass) and I haven't seen this behavior. And I 
think it has to do with PHP CPU use, as the site usually went down when 
hit simultaneously by a couple visits and some search ngines' spiders 
and now I'm being able to reproduce it by scaling and watermarking pics. 
But I don't know where else to look at.

Anybody else has seen this behaviour?

Thanks in advance

Regards

--
  Vicente Aguilar <bisente@bisente.com> | http://www.bisente.com
Posted by Maxim Dounin (Guest)
on 2010-02-21 20:55
(Received via mailing list)
Hello!

On Sun, Feb 21, 2010 at 11:19:48AM +0100, Vicente Aguilar wrote:

> into it lately and the problem seems to be browser-server 
> connections stuck on the CLOSE_WAIT state. With netstat -nap I 
> get loads of these:
> 
> $ netstat -nap | grep CLOSE_WAIT
> tcp        1      0 10.10.10.10:80        1.2.3.4:52132     
> CLOSE_WAIT  27672/nginx: worker

[...]

What does nginx -V show?  What's in config?

Maxim Dounin
Posted by Vicente Aguilar (Guest)
on 2010-02-22 07:47
(Received via mailing list)
> 
> What does nginx -V show?  What's in config?


You're right, should have started there. Sorry. :-)

$ nginx -V
nginx version: nginx/0.7.65
TLS SNI support enabled
configure arguments: --conf-path=/etc/nginx/nginx.conf 
--error-log-path=/var/log/nginx/error.log --pid-path=/var/run/nginx.pid 
--lock-path=/var/lock/nginx.lock 
--http-log-path=/var/log/nginx/access.log 
--http-client-body-temp-path=/var/lib/nginx/body 
--http-proxy-temp-path=/var/lib/nginx/proxy 
--http-fastcgi-temp-path=/var/lib/nginx/fastcgi --with-debug 
--with-http_stub_status_module --with-http_flv_module 
--with-http_ssl_module --with-http_dav_module 
--with-http_gzip_static_module --with-http_sub_module --with-mail 
--with-mail_ssl_module --with-ipv6 --with-http_perl_module 
--add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/nginx-upstream-fair 
--add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_upstream_memcached_hash_module-0.04 
--add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_secure_download 
--with-http_proxy_s3_auth

some bits of nginx config:

worker_processes  4;

error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
    multi_accept on;
}

http {
    include       /etc/nginx/mime.types;

    access_log  /var/log/nginx/access.log;

    sendfile        on;
    tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  20;
    keepalive_requests 50;
    tcp_nodelay        on;

    gzip  on;

client_max_body_size 32m;

  gzip_static on;

  gzip_http_version   1.1;
  gzip_proxied        expired no-cache no-store private auth;
  gzip_disable        "MSIE [1-6]\.";
  gzip_vary           on;

    location ~ .php$ {
        fastcgi_split_path_info ^(.+\.php)(.*)$;
        fastcgi_pass   127.0.0.1:9000;
        fastcgi_index  index.php;
        include fastcgi_params;
         fastcgi_param SCRIPT_FILENAME 
$document_root$fastcgi_script_name;
        fastcgi_param SERVER_NAME $http_host;
       fastcgi_param  QUERY_STRING     $query_string;
        fastcgi_param  REQUEST_METHOD   $request_method;
        fastcgi_param  CONTENT_TYPE     $content_type;
        fastcgi_param  CONTENT_LENGTH   $content_length;
        fastcgi_intercept_errors        on;
        fastcgi_ignore_client_abort on;
        fastcgi_connect_timeout 60;
        fastcgi_send_timeout 180;
        fastcgi_read_timeout 180;
        fastcgi_buffer_size 128k;
        fastcgi_buffers 4 256k;
        fastcgi_busy_buffers_size 256k;
        fastcgi_temp_file_write_size 256k;
    }


All the rest are the usual locations, rewrites, etc.

Have tried with several different combinations of worker_processes, 
sendfile, nopush, keepalive_* ... to no avail. Sometimes it takes longer 
to hang, but it always end up not responding with > 100 connections in 
CLOSE_WAIT. Killing nginx, waiting a couple of seconds for the 
connections in CLOSE_WAIT to disappear from netstat and starting nginx 
again fixes the issue. No need to restart the PHP processes.

Bye

--
  Vicente Aguilar <bisente@bisente.com> | http://www.bisente.com
Posted by Maxim Dounin (Guest)
on 2010-02-22 11:57
(Received via mailing list)
Hello!

On Mon, Feb 22, 2010 at 07:46:24AM +0100, Vicente Aguilar wrote:

> > 
> --pid-path=/var/run/nginx.pid --lock-path=/var/lock/nginx.lock 
> --add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_upstream_memcached_hash_module-0.04 
> --add-module=/usr/src/debian/nginx/nginx-0.7.65/modules/ngx_http_secure_download 
> --with-http_proxy_s3_auth

Try compiling without third party modules and patches and check if
you are able to reproduce the problem.  But see below for more
simple test to do before this one.

[...]

>         fastcgi_ignore_client_abort on;
>         fastcgi_connect_timeout 60;
>         fastcgi_send_timeout 180;
>         fastcgi_read_timeout 180;

You are ignoring client aborts, and has relatively large timeouts
set for fastcgi.  Are you sure the connections in question aren't
disappear as soon as your fastcgi backend finishes preparing
response?  I.e. check if any particular connection stay for at
least 5 minutes or so.

Additionally, check if you are able to reproduce the problem with
fastcgi_ignore_client_abort off.

[...]

> Have tried with several different combinations of 
> worker_processes, sendfile, nopush, keepalive_* ... to no avail. 
> Sometimes it takes longer to hang, but it always end up not 
> responding with > 100 connections in CLOSE_WAIT. Killing nginx, 
> waiting a couple of seconds for the connections in CLOSE_WAIT to 
> disappear from netstat and starting nginx again fixes the issue. 
> No need to restart the PHP processes.

Not responding just because of 100 connections seems strange for
nginx even with worker_connections 1024, so I suspect you just run
out of php processes and CLOSE_WAIT's are because of
fastcgi_ignore_client_abort.

Maxim Dounin
Posted by Vicente Aguilar (Guest)
on 2010-02-22 12:21
(Received via mailing list)
Hi

> Additionally, check if you are able to reproduce the problem with 
> fastcgi_ignore_client_abort off.

That was my current config which I copied from a site discussing 
php-fpm. My initial fastcgi config was:

  location ~ .php$ {
    # By all means use a different server for the fcgi processes if you 
need
    # to
#    fastcgi_pass   unix:/tmp/php-fastcgi.sock;
    fastcgi_pass   127.0.0.1:9000;
    fastcgi_index  index.php;
    fastcgi_param  SCRIPT_FILENAME /var/www/$host/$fastcgi_script_name;
    include /etc/nginx/fastcgi_params;
    fastcgi_intercept_errors on;
  }

And also had the problem.

> Not responding just because of 100 connections seems strange for 
> nginx even with worker_connections 1024, so I suspect you just run 
> out of php processes and CLOSE_WAIT's are because of 
> fastcgi_ignore_client_abort.

That's what I think too, but there are no stuck PHP connections in 
netstat. Whenever a PHP page is loaded I got some nginx-PHP sockets but 
they all close OK, none gets stuck. Only on the client-nginx end is 
where I can see this behavior with netstat.

Strange.
Posted by Maxim Dounin (Guest)
on 2010-02-22 13:21
(Received via mailing list)
Hello!

On Mon, Feb 22, 2010 at 12:20:26PM +0100, Vicente Aguilar wrote:

> > least 5 minutes or so.
>     fastcgi_pass   127.0.0.1:9000;
>     fastcgi_index  index.php;
>     fastcgi_param  SCRIPT_FILENAME /var/www/$host/$fastcgi_script_name;
>     include /etc/nginx/fastcgi_params;
>     fastcgi_intercept_errors on;
>   }
> 
> And also had the problem.

Do you also had CLOSE_WAIT sockets which
fastcgi_ignore_client_abort off?

> > Not responding just because of 100 connections seems strange 
> > for nginx even with worker_connections 1024, so I suspect you 
> > just run out of php processes and CLOSE_WAIT's are because of 
> > fastcgi_ignore_client_abort.
> 
> That's what I think too, but there are no stuck PHP connections 
> in netstat. Whenever a PHP page is loaded I got some nginx-PHP 
> sockets but they all close OK, none gets stuck. Only on the 
> client-nginx end is where I can see this behavior with netstat. 

By "stuck" you mean sockets in CLOSE_WAIT state?  It's expected
that there is no CLOSE_WAIT sockets between nginx and php.

Maxim Dounin
Posted by Vicente Aguilar (Guest)
on 2010-02-22 20:26
(Received via mailing list)
Hi

> Do you also had CLOSE_WAIT sockets which 
> fastcgi_ignore_client_abort off?

Have changed it to off after your previous mail and still haven't seen a 
single CLOSE_WAIT socket. Anyway the site doesn't really have that much 
traffic and today I haven't tried to stress it. Will keep and eye on it 
and tomorrow will try to reproduce it again.

> By "stuck" you mean sockets in CLOSE_WAIT state?  It's expected 
> that there is no CLOSE_WAIT sockets between nginx and php.


Well, I thought that if the CLOSE_WAIT sockets on the browser-nginx end 
are really caused by PHP, there should be another somehow blocked 
connection on the nginx-PHP end, but that's not the case.

Regards

--
  Vicente Aguilar <bisente@bisente.com> | http://www.bisente.com
Posted by Maxim Dounin (Guest)
on 2010-02-22 23:00
(Received via mailing list)
Hello!

On Mon, Feb 22, 2010 at 08:25:20PM +0100, Vicente Aguilar wrote:

> 
> > By "stuck" you mean sockets in CLOSE_WAIT state?  It's 
> > expected that there is no CLOSE_WAIT sockets between nginx and 
> > php.
> 
> Well, I thought that if the CLOSE_WAIT sockets on the 
> browser-nginx end are really caused by PHP, there should be 
> another somehow blocked connection on the nginx-PHP end, but 
> that's not the case.

There will be another connection between nginx and php, but it
will be live (in ESTABLISHED state).  It will wait for php to
finish request processing and sending response back to nginx.
Once php is done - nginx will close both nginx-php and
client-nginx connections.

Maxim Dounin
Posted by Vicente Aguilar (Guest)
on 2010-02-25 10:12
(Received via mailing list)
Hi

> There will be another connection between nginx and php, but it 
> will be live (in ESTABLISHED state).  It will wait for php to 
> finish request processing and sending response back to nginx.  
> Once php is done - nginx will close both nginx-php and 
> client-nginx connections.


That's what I though, but I've got no nginx-php connections in 
ESTABLISHED mode when a client-nginx connection goes to CLOSE_WAIT.

This is what I get now on a netstat -nap:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address 
State       PID/Program name
tcp        0      0 127.0.0.1:9000          0.0.0.0:* 
LISTEN      13040/php5-fpm
tcp        0      0 127.0.0.1:3306          0.0.0.0:* 
LISTEN      6731/mysqld
tcp        0      0 0.0.0.0:80              0.0.0.0:* 
LISTEN      16936/nginx
tcp        0      0 0.0.0.0:22              0.0.0.0:* 
LISTEN      881/sshd
tcp        0      0 127.0.0.1:25            0.0.0.0:* 
LISTEN      1416/exim4
tcp      324      0 10.10.10.10:80        66.249.68.247:36979 
ESTABLISHED -
tcp      113      0 127.0.0.1:80            127.0.0.1:39229 
CLOSE_WAIT  -
tcp      335      0 10.10.10.10:80        65.55.207.102:47057 
ESTABLISHED -
tcp      116      0 10.10.10.10:80        83.170.113.102:60149 
ESTABLISHED -
tcp      413      0 10.10.10.10:80        81.52.143.26:51658 
CLOSE_WAIT  -
tcp      117      0 10.10.10.10:80        83.170.113.102:56117 
CLOSE_WAIT  -
tcp        1      0 10.10.10.10:80        66.249.68.247:62666 
CLOSE_WAIT  16940/nginx: worker
tcp        1      0 10.10.10.10:80        66.249.68.247:37085 
CLOSE_WAIT  16939/nginx: worker
tcp        1      0 10.10.10.10:80        66.249.68.247:51700 
CLOSE_WAIT  16938/nginx: worker
tcp        1      0 10.10.10.10:80        67.195.115.83:53503 
CLOSE_WAIT  16937/nginx: worker
tcp      117      0 10.10.10.10:80        74.52.50.50:52833 
CLOSE_WAIT  -
tcp      483      0 10.10.10.10:80        213.171.250.126:35648 
ESTABLISHED -
tcp        0    288 10.10.10.10:22        193.145.230.6:50184 
ESTABLISHED 17110/0
tcp        1      0 10.10.10.10:80        87.68.237.93:51575 
CLOSE_WAIT  16939/nginx: worker
tcp      244      0 10.10.10.10:80        123.125.66.48:22675 
CLOSE_WAIT  -
tcp        1      0 10.10.10.10:80        193.145.230.6:50134 
CLOSE_WAIT  16938/nginx: worker
tcp      116      0 10.10.10.10:80        74.52.50.50:55178 
ESTABLISHED -
tcp6       0      0 :::22                   :::* 
LISTEN      881/sshd
udp        0      0 10.10.10.10:53        0.0.0.0:* 
1515/tinydns
udp        0      0 0.0.0.0:68              0.0.0.0:* 
769/dhclient3
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node   PID/Program 
name    Path
unix  2      [ ACC ]     STREAM     LISTENING     481377   6731/mysqld 
/var/run/mysqld/mysqld.sock
unix  3      [ ]         DGRAM                    601444   867/rsyslogd 
/dev/log
unix  2      [ ]         DGRAM                    673      298/udevd 
@/org/kernel/udev/udevd
unix  2      [ ]         DGRAM                    611850   17110/0
unix  3      [ ]         STREAM     CONNECTED     609800   16936/nginx
unix  3      [ ]         STREAM     CONNECTED     609799   16936/nginx
unix  3      [ ]         STREAM     CONNECTED     609797   16936/nginx
unix  3      [ ]         STREAM     CONNECTED     609796   16936/nginx
unix  3      [ ]         STREAM     CONNECTED     609794   16936/nginx
unix  3      [ ]         STREAM     CONNECTED     609793   16936/nginx
unix  3      [ ]         STREAM     CONNECTED     609791   16936/nginx
unix  3      [ ]         STREAM     CONNECTED     609790   16936/nginx
unix  2      [ ]         DGRAM                    533124   769/dhclient3
unix  2      [ ]         DGRAM                    481379   6735/logger
unix  3      [ ]         STREAM     CONNECTED     324617 
27662/php5-fpm
unix  3      [ ]         STREAM     CONNECTED     324616 
27662/php5-fpm
unix  3      [ ]         STREAM     CONNECTED     324613 
27662/php5-fpm
unix  3      [ ]         STREAM     CONNECTED     324612 
27662/php5-fpm


11 client-nginx connections in CLOSE_WAIT, no nginx-php connections. 
Unless they're the last four unix domain sockets connections, but I've 
configured nginx to use 127.0.0.1:9000 as the fcgi server.

How can I debug what these CLOSE_WAIT connections were doing, which 
request were they serving? Anything I can activate on the logs or on 
nginx-status, a la Apache's extended server-status?

Thanks
Posted by Maxim Dounin (Guest)
on 2010-02-25 13:59
(Received via mailing list)
Hello!

On Thu, Feb 25, 2010 at 10:11:23AM +0100, Vicente Aguilar wrote:

> 
> tcp      113      0 127.0.0.1:80            127.0.0.1:39229         CLOSE_WAIT  -
> tcp        0    288 10.10.10.10:22        193.145.230.6:50184     ESTABLISHED 17110/0
> unix  3      [ ]         DGRAM                    601444   867/rsyslogd        /dev/log
> unix  2      [ ]         DGRAM                    533124   769/dhclient3
> the fcgi server.
>From the output you provided it looks like all nginx workers are 
locked out, either doing something or waiting for some system
resources.  As you can see - all connections accepted by nginx (6
connections which have nginx process listed in pid column) are in
CLOSE_WAIT state, and there are other connections to port 80 which
are sitting in listen queue.  Am I right in the assumption that
nginx does not answer any requests?

You have to examine nginx workers to find out what they are doing.
Try starting from top, truss, gdb and examining your system logs.

Note well: you haven't posted full config you use, so please check
yourself for possible loops in it.  I've recently posted some
patches which take care of several loops which aren't automatically
resolved now, see here for patch and example loops:

http://nginx.org/pipermail/nginx-devel/2010-January/000099.html

It should be trivial to find if it's the cause though, as nginx
worker will eat 100% cpu once caught in such loop.

Note well 2: I've already asked you to try compiling without third
party modules and patches and check if you are able to reproduce
the problem.  It doesn't really make sense to proceed any further
without doing this.

> How can I debug what these CLOSE_WAIT connections were doing, 
> which request were they serving? Anything I can activate on the 
> logs or on nginx-status, a la Apache's extended server-status?

You have to enable debug log (see
http://nginx.org/en/docs/debugging_log.html).  Then it will be
possible to map fd number to the particular request (and it's full
logs).  Under linux it should be possible to find out fd number of
the particular connection via lsof -p <pid-of-nginx-worker>.

Maxim Dounin
Posted by Vicente Aguilar (Guest)
on 2010-02-25 14:57
(Received via mailing list)
Hi

>> From the output you provided it looks like all nginx workers are 
> locked out, either doing something or waiting for some system 
> resources.  As you can see - all connections accepted by nginx (6 
> connections which have nginx process listed in pid column) are in 
> CLOSE_WAIT state, and there are other connections to port 80 which 
> are sitting in listen queue.  Am I right in the assumption that 
> nginx does not answer any requests?

Yes, that's the issue. nginx becomes unresponsive at this point until I 
restart it.

> Note well: you haven't posted full config you use, so please check 
> yourself for possible loops in it.  I've recently posted some 
> patches which take care of several loops which aren't automatically 
> resolved now, see here for patch and example loops:
> 
> http://nginx.org/pipermail/nginx-devel/2010-January/000099.html
> 
> It should be trivial to find if it's the cause though, as nginx 
> worker will eat 100% cpu once caught in such loop.

I have a monitoring script that detects these situations (wget can't 
download from localhost with a 20s timeout) and restarts nginx, but 
before that it captures a netstat -nap, ps and other system metrics. 
This is an example of what ps shows:

www-data 24610  0.0  0.1   7476  2452 ?        S    07:44   0:00 nginx: 
worker process
www-data 24611  0.0  0.1   7668  2412 ?        S    07:44   0:00 nginx: 
worker process
www-data 24612  0.0  0.1   7668  2416 ?        S    07:44   0:00 nginx: 
worker process
www-data 24613  0.0  0.1   7736  2624 ?        S    07:44   0:00 nginx: 
worker process

And vmstat:

procs -----------memory---------- ---swap-- -----io---- -system-- 
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa
 2  0    440 157012 181076 1180340    0    0     2    32   27   46  2  0 
95  0
 0  0    440 156904 181076 1180340    0    0     0     0   26   28  2  0 
94  0
 0  0    440 156888 181076 1180348    0    0     0     0   13   24  0  0 
100  0
 0  0    440 156888 181076 1180348    0    0     0     0   12   21  0  0 
100  0
 0  0    440 156888 181080 1180348    0    0     0   128   22   34  0  0 
99  1

So the nginx processes don't seem to be in a loop, CPU use is 
negligible.

> Note well 2: I've already asked you to try compiling without third 
> party modules and patches and check if you are able to reproduce 
> the problem.  It doesn't really make sense to proceed any further 
> without doing this.

I have to admit I still haven't tried this, sorry. :) Will try.

> You have to enable debug log (see 
> http://nginx.org/en/docs/debugging_log.html).  Then it will be 
> possible to map fd number to the particular request (and it's full 
> logs).  Under linux it should be possible to find out fd number of 
> the particular connection via lsof -p <pid-of-nginx-worker>.

Will look into this too and get that info on the monitoring script. Can 
you think of any other system parameter that can be useful to monitor in 
these cases?

Thanks a lot Maxim. You're being really helpful. :-)

Regards
Posted by Benjamin Pineau (Guest)
on 2010-02-25 15:50
(Received via mailing list)
Vicente Aguilar a écrit :

> tcp      117      0 10.10.10.10:80        83.170.113.102:56117  
>  CLOSE_WAIT  -

Out of curiosity, did you try switching the event module to
anything but the default (epoll) ?
ie. something like:
events {
  use select;
}

The accumulating non-empty Recv-Qs and the pending CLOSE_WAITs (ie.
close() never triggered on server side) behaviors are typical
symptoms for races conditions when using an edge-triggered I/O
interface...
Posted by Vicente Aguilar (Guest)
on 2010-02-26 08:29
(Received via mailing list)
Hi

> Note well 2: I've already asked you to try compiling without third 
> party modules and patches and check if you are able to reproduce 
> the problem.  It doesn't really make sense to proceed any further 
> without doing this.


I've gone back to the original Debian Lenny package (0.6.32-3+lenny3, I 
was using a patched 0.7.65) and in ~14h have had no issues at all, 0 
sockets in the CLOSE_WAIT state.

I'm going to leave it like this the whole weekend and see, but it seems 
there was some issue with the 0.7 release or some of the patches I was 
using. Funny thing is, I'm using that same binary at work but with 
Tomcat instead of PHP and had no issues at all. Anyway, next week I'll 
upgrade patch by patch and try to guess which one was causing the 
problem.

Not sure if this could be related to the event module as Benjamin 
suggested. I'm using the default one (nothing on nginx.conf), I've tried 
to change it but nginx always failed to start. I'm not sure which ones 
are compiled in ATM, will try to find that out today.

I'll tell you when I find out what the cause of the problem was, just 
for the sake of having it documented and showing up on Google in case 
somebody else hits the same issue. :-)

Thanks again
Posted by Vicente Aguilar (Guest)
on 2010-02-26 09:25
(Received via mailing list)
HI

> Not sure if this could be related to the event module as Benjamin suggested. I'm using the default one (nothing on nginx.conf), I've tried to change it but nginx always failed to start. I'm not sure which ones are compiled in ATM, will try to find that out today.


Using the epoll event module with both nginx 0.6.32 and 0.7.65. No other 
event modules compiled in.

BTW the server is an Amazon EC2 instance, not sure if that might affect 
things or if in this case some event module is better than other. :-?

Regards
Posted by Vicente Aguilar (Guest)
on 2010-02-26 20:37
(Received via mailing list)
Hi

> I'll tell you when I find out what the cause of the problem was, just for the sake of having it documented and showing up on Google in case somebody else hits the same issue. :-)


I *think* I might have found what's going on:

On my blog I have some sample scripts for running several servers with 
daemontools, and some of them are browsable (autoindex on) with the full 
daemontools directories structure. If you've worked with daemontools 
you'll know it uses some named pipes (fifos) ... :)

I've tracked several different processes with sockets on CLOSE_WAIT on 
the debug log and the last line of all of them was accessing one of 
these fifos. I've tried requesting those files on a freshly restarted 
nginx and have reproduced the issue: each GET to one of the fifos always 
produced one or sometimes two CLOSE_WAIT sockets. So in the end it seems 
it had nothing to do with PHP.

I've removed all the fifos from my site and will keep running nginx 
0.6.32 during the weekend. If I have no more issues (I'm pretty 
confident now I won't), on monday I'll go back to my patched 0.7.65.

Configuring the debug log has been crucial here. Thanks for that tip, 
Maxim. :)

On a side note: while I agree this was my fault (I should'n have had 
those "empty" pipes there on the first place), neither Apache nor 
lighttpd had any problems with this. Or maybe they had but some internal 
process-cleaning was in place and these stuck processes were being 
silently killed, I don't know. In any case as I've stated before, I 
think the problem is not really nginx but the fifos that shouldn't be 
there.

Regards

--
  Vicente Aguilar <bisente@bisente.com> | http://www.bisente.com
Posted by Vicente Aguilar (Guest)
on 2010-03-03 14:29
(Received via mailing list)
Hi

> I've tracked several different processes with sockets on CLOSE_WAIT on the debug log and the last line of all of them was accessing one of these fifos. I've tried requesting those files on a freshly restarted nginx and have reproduced the issue: each GET to one of the fifos always produced one or sometimes two CLOSE_WAIT sockets. So in the end it seems it had nothing to do with PHP.


I can confirm the problem was because of the FIFOs. I've had no 
CLOSE_WAIT sockets at all since last friday when I removed them, the 
weekend with nginx 0.6.32 and since monday morning with my patched 
0.7.65.

Thanks to everyone who helped me debugging this, specially to Maxim.

Regards
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.