Cache manager process exited with fatal code 2 and cannot be respawned

Hi,

after restarting nginx I find

2012/11/07 10:24:02 [alert] 23635#0: 512 worker_connections are not
enough
2012/11/07 10:24:02 [alert] 23636#0: 512 worker_connections are not
enough
2012/11/07 10:24:04 [alert] 23618#0: cache manager process 23635 exited
with fatal code 2 and cannot be respawned

in my logs. It seems like this error came up after adding more then 2500
virtual hosts, each consisting of two server blocks, one for http, and
one for https.

Now I don’t quite understand these messages. In my nginx.conf I have
user www-data;
worker_processes 16;
pid /var/run/nginx.pid;
worker_rlimit_nofile 65000;

events {
worker_connections 2000;
use epoll;
# multi_accept on;
}

so that should be enough worker_connections. Why am I still getting this
message?

For the other message regarding the cache manger, I found this
http://www.ruby-forum.com/topic/519162
thread, where Maxim D. suggests that it results from the kernel not
supporting eventfd(). But as far as I understand this is only an issue
with kernels bevore 2.6.18. I use 2.6.32 and my kernel config clearly
states
CONFIG_EVENTFD=y

Here is the nginx version and configure options:
root@debian:~# nginx -V
nginx version: nginx/1.2.4
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx/ --sbin-path=/usr/sbin/nginx
–conf-path=/etc/nginx/nginx.conf
–error-log-path=/var/log/nginx/error.log
–http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid
–lock-path=/var/run/nginx.lock
–http-client-body-temp-path=/var/cache/nginx/client_temp
–http-proxy-temp-path=/var/cache/nginx/proxy_temp
–http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp
–http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp
–http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx
–group=nginx --with-http_ssl_module --with-http_realip_module
–with-http_addition_module --with-http_sub_module
–with-http_dav_module --with-http_flv_module --with-http_mp4_module
–with-http_gzip_static_module --with-http_random_index_module
–with-http_secure_link_module --with-http_stub_status_module
–with-mail --with-mail_ssl_module --with-file-aio --with-ipv6

Any ideas?

Isaac

So I also tried Version 1.2.1 form debian backports, which produced the
same error.

I tried on opensuse 12.2, which worked fine:

nginx version: nginx/1.0.15
built by gcc 4.7.1 20120713 [gcc-4_7-branch revision 189457] (SUSE
Linux)
TLS SNI support enabled
configure arguments: --prefix=/usr/ --sbin-path=/usr/sbin/nginx
–conf-path=/etc/nginx/nginx.conf
–error-log-path=/var/log/nginx/error.log
–http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid
–lock-path=/var/run/nginx.lock
–http-client-body-temp-path=/var/lib/nginx/tmp/
–http-proxy-temp-path=/var/lib/nginx/proxy/
–http-fastcgi-temp-path=/var/lib/nginx/fastcgi/
–http-uwsgi-temp-path=/var/lib/nginx/uwsgi/
–http-scgi-temp-path=/var/lib/nginx/scgi/ --user=nginx --group=nginx
–with-rtsig_module --with-select_module --with-poll_module --with-ipv6
–with-file-aio --with-http_ssl_module --with-http_realip_module
–with-http_addition_module --with-http_xslt_module
–with-http_image_filter_module --with-http_geoip_module
–with-http_sub_module --with-http_dav_module --with-http_flv_module
–with-http_gzip_static_module --with-http_random_index_module
–with-http_secure_link_module --with-http_degradation_module
–with-http_stub_status_module --with-http_perl_module
–with-perl=/usr/bin/perl --with-mail --with-mail_ssl_module --with-pcre
–with-libatomic --add-module=passenger/ext/nginx --with-md5=/usr
–with-sha1=/usr --with-cc-opt=’-fmessage-length=0 -O2 -Wall
-D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables
-fasynchronous-unwind-tables -g -fstack-protector’

So could it be that this is an issue with the 1.2 Series?

Isaac

On Nov 7, 2012, at 13:49 , Isaac H. wrote:

Now I don’t quite understand these messages. In my nginx.conf I have

so that should be enough worker_connections. Why am I still getting this
message?

For the other message regarding the cache manger, I found this
Nginx: worker process exited with fatal code 2 and cannot be respawn | eventfd() failed - NGINX - Ruby-Forum
thread, where Maxim D. suggests that it results from the kernel not
supporting eventfd(). But as far as I understand this is only an issue with
kernels bevore 2.6.18. I use 2.6.32 and my kernel config clearly states
CONFIG_EVENTFD=y

These message have no relation to eventfd().

A process with pid of 23636 is probably cache loader. Both cache manager
and loader
do not use configured worker_connection number since they do not process
connections
at all. However, they need one connection slot to communicate with
master process.

512 connections may be taken by listen directives if they use different
addreses,
or by resolvers if you defined a resolver in every virtual host.
A quick workaround is to define just a single resovler at http level.


Igor S.

So could it be that this is an issue with the 1.2 Series?
Ok, this is not the case: I tried 1.0.15 build by hand on debian, and
have the same issue.

Isaac

These message have no relation to eventfd().

A process with pid of 23636 is probably cache loader. Both cache manager and
loader
do not use configured worker_connection number since they do not process
connections
at all. However, they need one connection slot to communicate with master
process.

512 connections may be taken by listen directives if they use different
addreses,
or by resolvers if you defined a resolver in every virtual host.
A quick workaround is to define just a single resovler at http level.
Hm, there were no resolvers defined in the virtual hosts. But I tried to
add
resolver 127.0.0.1;
to my https section, but that did not help.

Also, if resolvers would be the problem, it should also happen with
other nginx builds, like the one I tested on opensuse, see my reply
earlier today.

Here is my config, including one vhost:

user www-data;
worker_processes 16;
pid /var/run/nginx.pid;
worker_rlimit_nofile 65000;

events {
use epoll;
worker_connections 2000;
# multi_accept on;
}

http {

     ##
     # Basic Settings
     ##

     sendfile on;
     tcp_nopush on;
     tcp_nodelay on;
     keepalive_timeout 65;
     types_hash_max_size 2048;
     # server_tokens off;

     # server_names_hash_bucket_size 64;
     # server_name_in_redirect off;

     include /etc/nginx/mime.types;
     default_type application/octet-stream;

     ##
     # Logging Settings
     ##

     access_log /var/log/nginx/access.log;
     error_log /var/log/nginx/error.log debug;
     #error_log /var/log/nginx/error.log;

     ##
     # Gzip Settings
     ##

     gzip on;
     gzip_disable "msie6";

     # gzip_vary on;
     # gzip_proxied any;
     # gzip_comp_level 6;
     # gzip_buffers 16 8k;
     # gzip_http_version 1.1;
     # gzip_types text/plain text/css application/json

application/x-javascript text/xml application/xml application/xml+rss
text/javascript;
# Because we have a lot of server_names, we need to increase
# server_names_hash_bucket_size
# (Server names)
server_names_hash_max_size 32000;
server_names_hash_bucket_size 1024;

     # raise default values for php
     client_max_body_size 20M;
     client_body_buffer_size 128k;

     ##
     # Virtual Host Configs
     ##
     include /etc/nginx/conf.d/*.conf;
     include /var/www3/acme_cache/load_balancer/upstream.conf;
     include /etc/nginx/sites-enabled/*;

     index index.html index.htm ;

     ##
     # Proxy Settings
     ##

     # include hostname in request to backend
     proxy_set_header Host $host;

     # only honor internal Caching policies
     proxy_ignore_headers X-Accel-Expires Expires Cache-Control;

     # hopefully fixes an issue with cache manager dying
     resolver 127.0.0.1;

}

Then in /etc/nginx/sites-enabled/ there is eg
server
{
server_name www.acme.eu acmeblabla.eu;
listen 45100;
ssl on;
ssl_certificate /etc/nginx/ssl/acme_eu.crt;
ssl_certificate_key /etc/nginx/ssl/acme_eu.key;
access_log /var/log/www/m77/acmesystems_de/log/access.log;
error_log /var/log/nginx/vhost_error.log;
proxy_cache acme-cache;
proxy_cache_key “$scheme$host$proxy_host$uri$is_args$args”;
proxy_cache_valid 200 302 60m;
proxy_cache_valid 404 10m;

     location ~* \.(jpg|gif|png|css|js)
     {
             try_files $uri @proxy;
     }

     location @proxy
     {
             proxy_pass https://backend-www.acme.eu_p45100;
     }

     location /
     {
             proxy_pass https://backend-www.acme.eu_p45100;
     }

}
upstream backend-www.acme.eu_p45100
{
server 10.1.1.25:45100;
server 10.1.1.26:45100;
server 10.1.1.27:45100;
server 10.1.1.28:45100;
server 10.1.1.15:45100;
server 10.1.1.18:45100;
server 10.1.1.20:45100;
server 10.1.1.36:45100;
server 10.1.1.39:45100;
server 10.1.1.40:45100;
server 10.1.1.42:45100;
server 10.1.1.21:45100;
server 10.1.1.22:45100;
server 10.1.1.23:45100;
server 10.1.1.29:45100;
server 10.1.1.50:45100;
server 10.1.1.43:45100;
server 10.1.1.45:45100;
server 10.1.1.46:45100;
server 10.1.1.19:45100;
server 10.1.1.10:45100;
}

Isaac

On 11/9/12 5:15 PM, Isaac H. wrote:
[…]

I also wonder where the 512 worker_connections from the error
message come from. There is no such number in my config. Is it
hardcoded somewhere?

http://nginx.org/en/docs/ngx_core_module.html#worker_connections

It’s a default number of worker_connections.


Maxim K.
+7 (910) 4293178

Refining my observations:

Its not an issue of version or OS … that were wrong obersvations on my
side.

But: Of the approx. 5000 vhost, there are about 1000 who do ssl, each on
a different (high) port.

So without the ssl vhosts, I have about 1000 open files for nginx
(lsof |grep nginx|wc)
And nginx runs fine.

With the ssl vhosts, I have about 17000 open files. And I get the
errors.

Does that ring a bell somewhere?
Also, 17000 is about 16 (amount of worker processes) * 1000 (num ssl
hosts) + 1000 (nofiles without ssl).

I also wonder where the 512 worker_connections from the error message
come from. There is no such number in my config. Is it hardcoded
somewhere?

Isaac

What does ‘cat /proc/sys/fs/file-rn’ say?


Maxim K.
+7 (910) 4293178

On 11/09/2012 07:52 PM, Maxim K. wrote:

What does ‘cat /proc/sys/fs/file-rn’ say?

cat /proc/sys/fs/file-nr
1696 0 205028

On 11/09/2012 05:27 PM, Maxim K. wrote:

On 11/9/12 5:15 PM, Isaac H. wrote:
[…]

I also wonder where the 512 worker_connections from the error
message come from. There is no such number in my config. Is it
hardcoded somewhere?

Core functionality

It’s a default number of worker_connections.
Yes, but if I specify a differen number, like
Cache manager process exited with fatal code 2 and cannot be respawned - NGINX - Ruby-Forum
this should be different. Now this could lead to the conclusion, that
nginx is not reading that file, but nginx -t clearly says so. Also,
if I introduce syntactic errors in that file, nginx complains.

As Igor S. suggested earlier
http://www.ruby-forum.com/topic/4407591#1083572
the worker_connection parameter might not be related, since also cache
manager and loader use connections.
If these are hard coded to a max of 512, this might be the cause: there
are exactly 1002 vhosts which each listen on a different port. Now its
not 1024, which would be 512*2, but may be there is some overhead which
makes me come to this limit?
If my thinking is correct (?), is there a way to overcome this limit?
(other then using just one port for ssl … it would mean using
different ip addresses, which would have the same effect, I guess?)

Any thoughts on this are welcome.

Isaac

Am 09.11.2012 19:33, schrieb Isaac H.:

I did several hours of testing today with Isaac and there are two
problems.

PROBLEM/BUG ONE:

First of all: The customer has 1.000 SSL-hosts on the nginx-Server, so
he wants to have 1000 listeners on TCP-Ports. But the cache_manager
isn’t able to open so many listeners. He’s crashing after 512 open
listeners. It looks very much like the cache_manager doesn’t read the
worker_connections setting from nginx.conf.

We configured:

worker_connections 10000;

there, but the cache_manager crashes with

2012/11/09 17:53:11 [alert] 9345#0: 512 worker_connections are not
enough
2012/11/09 17:53:12 [alert] 9330#0: cache manager process 9344 exited
with fatal code 2 and cannot be respawned

I did some testing: Having 505 SSL-hosts on the Server (=505 listener
sockets) everything’s working fine, but 515 listener sockets aren’t
possible.

It’s easy to reproduce: Just define 515 ssl-domains having different
TCP-ports for every domain. :slight_smile:

Looks like nobody had the idea before, that “somebody” ™ could run
more then 2 times /24-network-IPs on one single host. In fact, this does
not happen in normal life…

But for historical reasons ™ our customer uses ONE ip-address and
several TCP-Ports for that so he doesn’t have a problem running so many
differend SSL-hosts on one system – and this is the special situation
where we can see the bug (?), that the cache_manager ignores the
worker_connection-setting (?), when he tries to open all the listeners
and relating cache-files/sockets.

So: Looks like a bug? Who can help? We need help…

PROBLEM/BUG TWO:

Having 16 workers for 1000 ssl-domains with 1000 listeners, we can see
16 * 1000 open TCP-listeners on that system, because every worker open
it’s own listeners (?). When we reach the magical barrier of 16386 open
listeners (lsof -i | grep -c nginx), nginx is running into some kind of
file limitations:

2012/11/09 20:32:05 [alert] 9933#0: socketpair() failed while spawning
“worker process” (24: Too many open files)
2012/11/09 20:32:05 [alert] 9933#0: socketpair() failed while spawning
“cache manager process” (24: Too many open files)
2012/11/09 20:32:05 [alert] 9933#0: socketpair() failed while spawning
“cache loader process” (24: Too many open files)

It’s very easy to see, that the limitation is based on 16.386 open files
and sockets from nginx.

But I can’t find the place, where this limitation comes from. “ulimit
-n” is set to 100.000, everything’s looking fine and should work with
many more open files then just 16K.

Could it be, that “nobody” ™ expected, that “somebody” ™ runs more
then 1000 ssl-hosts with different TCP-ports on 16 worker-instances and
that there’s some kind of SMALL-INT-problem in the nginx code? Could it
be, that this isn’t a limitation from the linux system, but from some
kind of too small address-space for that in nginx?

So: Looks like a bug? Who can help? We need help…

Peer


Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

Tel: 030 / 405051-42
Fax: 030 / 405051-19

Zwangsangaben lt. 35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschftsfhrer: Peer H. – Sitz: Berlin

Am 09.11.2012 21:06, schrieb Andrew A.:

Are you looking for a commercial support option to back up your customer’s
contract with an underpinning contract and vendor support?

First of all I’m reporting some severe bugs in nginx. nginx should be
interested in that and we really spent a lof of time for debugging and
analyzing this (and, this many time has NOT been paid).

And:

I’ve already been on the commercial support page but there was no “by
call”-support. I’m not interested in 12-month-contracts to solve one
single problem.

I do ** NOT ** have a problem paying somebody to fix that. I would have
been happy the last few days having somebody else familiar with nginx
debugging and fixing that.

Unfortunetely there was no “by call”-Support (or I haven’t found that).

Feel free to send me offlist an offer about fixing this bug ASAP.

I’d appreciate this!

Peer


Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

Tel: 030 / 405051-42
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer H. – Sitz: Berlin

Hi,

On Nov 9, 2012, at 23:36, Peer H. [email protected]
wrote:

isn’t able to open so many listeners. He’s crashing after 512 open
2012/11/09 17:53:12 [alert] 9330#0: cache manager process 9344 exited
Looks like nobody had the idea before, that “somebody” ™ could run
So: Looks like a bug? Who can help? We need help…
2012/11/09 20:32:05 [alert] 9933#0: socketpair() failed while spawning
-n" is set to 100.000, everything’s looking fine and should work with


Heinlein Support GmbH

Are you looking for a commercial support option to back up your
customer’s contract with an underpinning contract and vendor support?

I that’s the case we’ve got our support options described here:

Hope this helps

On 11/9/12 10:33 PM, Isaac H. wrote:

the worker_connection parameter might not be related, since also
Any thoughts on this are welcome.

Isaac

Just for the record – the issue should be fixed by r4918:

http://trac.nginx.org/nginx/changeset/4918/nginx


Maxim K.
+7 (910) 4293178

On Nov 10, 2012, at 0:15, Peer H. [email protected]
wrote:

Am 09.11.2012 21:06, schrieb Andrew A.:

Are you looking for a commercial support option to back up your customer’s
contract with an underpinning contract and vendor support?

First of all I’m reporting some severe bugs in nginx. nginx should be
interested in that and we really spent a lof of time for debugging and
analyzing this (and, this many time has NOT been paid).

Thanks much. What about also filling out a bug report in trac please?
We’d definitely look more into that one and fix it during our normal dev
cycle for 1.3.x.

And:

I’ve already been on the commercial support page but there was no “by
call”-support. I’m not interested in 12-month-contracts to solve one
single problem.

Got it.

I do ** NOT ** have a problem paying somebody to fix that. I would have
been happy the last few days having somebody else familiar with nginx
debugging and fixing that.

Unfortunetely there was no “by call”-Support (or I haven’t found that).

I’m glad you like what you’re doing for a living. Appreciate your
efforts debugging nginx too. We fix a lot of things and often - check
the changelogs. We don’t have enough resources to fix everything ASAP
though. If you’ve got certain commercial commitments, so do we.

There are different options on

including an option to do custom inquiry.