High number connections-writing stuck

Hi,

Recently we added a ‘thread_pool’ directive to our main configuration. A
few hours later we saw a huge increase in the connections_writing stat
as
reported by stub_status module. This number reached ± 3800 and is stuck
there since. The server in question is operating normally, but this is
very
strange.

Any hints on what this could be?

Some info:

  • Here is a graph of the stats reported, for a server with thread_pool
    and
    another without: http://imgur.com/a/lF2EL

  • I don`t have older data anymore, but the jump from <100 to ± 3800
    connections_writing happened in two sharp jumps. The first one following
    a
    reload;

  • The machines’ hardware and software are identical except for the
    thread_pool directive in their nginx.conf. They live in two different
    data
    centers;

  • Both machines are performing normally. Nothing unusual in CPU or RAM
    usage. Nginx performance is about the same.

  • Reloading Nginx with ‘nginx -s reload’ does nothing. Restarting the
    process brings connections_writing down.

Debug stuff:

mallmann# uname -a
Linux xxx 3.8.13-98.5.2.el6uek.x86_64 #2 SMP Tue Nov 3 18:32:04 PST 2015
x86_64 x86_64 x86_64 GNU/Linux

mallmann# nginx -V
nginx version: nginx/1.8.0
built by gcc 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
built with OpenSSL 1.0.1e-fips 11 Feb 2013
TLS SNI support enabled
configure arguments: --prefix=/usr/share/nginx
–sbin-path=/usr/sbin/nginx
–conf-path=/etc/nginx/nginx.conf
–error-log-path=/var/log/nginx/error.log
–http-log-path=/var/log/nginx/access.log
–http-client-body-temp-path=/var/lib/nginx/tmp/client_body
–http-proxy-temp-path=/var/lib/nginx/tmp/proxy
–http-fastcgi-temp-path=/var/lib/nginx/tmp/fastcgi
–http-uwsgi-temp-path=/var/lib/nginx/tmp/uwsgi
–http-scgi-temp-path=/var/lib/nginx/tmp/scgi
–pid-path=/var/run/nginx.pid
–lock-path=/var/lock/subsys/nginx --user=nginx --group=nginx
–with-ipv6
–with-http_ssl_module --with-http_realip_module
–with-http_addition_module --with-http_xslt_module
–with-http_image_filter_module --with-http_geoip_module
–with-http_sub_module --with-http_flv_module --with-http_mp4_module
–with-http_gunzip_module --with-http_gzip_static_module
–with-http_random_index_module --with-http_secure_link_module
–with-http_degradation_module --with-http_stub_status_module
–with-http_perl_module --with-mail --with-mail_ssl_module --with-pcre
–with-google_perftools_module
–add-module=/builddir/build/BUILD/nginx-1.8.0/headers-more-nginx-module-0.25
–add-module=/builddir/build/BUILD/nginx-1.8.0/ngx_http_bytes_filter_module
–add-module=/builddir/build/BUILD/nginx-1.8.0/echo-nginx-module-0.55
–with-threads --with-debug --with-cc-opt=‘-O2 -g -pipe -Wall
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
–param=ssp-buffer-size=4 -m64 -mtune=generic’ --with-ld-opt=’ -Wl,-E’

Affected server:

mallmann# lsof -n -u nginx | awk ‘{print $5}’ | sort | uniq -c | sort
-nr
4172 REG
2140 IPv4
100 unix
30 CHR
20 DIR
20 0000
3 sock
1 TYPE
mallmann# curl http://127.0.0.1/status
Active connections: 5924
server accepts handled requests
5864099 5864099 15527178
Reading: 0 Writing: 3883 Waiting: 2040

Normal server:

mallmann# lsof -n -u nginx | awk ‘{print $5}’ | sort | uniq -c | sort
-nr
4454 REG
1967 IPv4
100 unix
30 CHR
20 DIR
20 0000
1 unknown
1 TYPE
1 sock
mallmann# curl http://127.0.0.1/status
Active connections: 2096
server accepts handled requests
1136132 1136132 3464904
Reading: 0 Writing: 107 Waiting: 1989

From what you provided, I do not see anything unexpected…

Enabling threads usage with thread_pool and aio threads enables
asynchronous file reading/writing.
The connections_writing stat indicates… the number of connections to
which the server is writing a response.

Since you are working asynchronously, if the request takes much less
time
to process than the response to be sent, it is normal this numbers
increases, until reaching a balance point.

You saw this number increasing when you reloaded the configuration ->
shows
the new configuration has been applied immediately
You saw this number drop on restart -> restarting kills the current
processes and the attached connections
Nothing to see on CPU & RAM -> they are probably not the bottleneck, and
one nginx strength is to have a minimal fingerprint there, compared to
other Web server alternatives. Have a look at I/O wait to check if the
worker processes might be having a hard time getting content they seek
from
the disk. Less often, it might also be a congestion on backend
communication if content is being read from some.

To me, it is just a matter of raton time to process request/time to
serve
answer. Multithreading means a worker might be answering several
connections at the same time.
If I am right about nginx’ design, events might be processed in any
order,
but each worker still only does one thing at a time and onky switch task
when the current one is finished/waiting on something else.
Multithreading
changes the whole behavior.

Hope to help,

B. R.

Hi,

Just getting back to you. Sorry about the delay.

This behaviour was caused by a custom patch in a module. The patch was
ending the requests without finalizing them, leaking connections.

Eventually Nginx just exploded =) Nothing like production workload to
stress things out.

Thanks a lot.

On Sun, Dec 20, 2015 at 11:04 AM, Valentin V. Bartenev [email protected]

On Friday 18 December 2015 15:55:47 Marcelo MD wrote:

thread_pool directive in their nginx.conf. They live in two different data
centers;

  • Both machines are performing normally. Nothing unusual in CPU or RAM
    usage. Nginx performance is about the same.

  • Reloading Nginx with ‘nginx -s reload’ does nothing. Restarting the
    process brings connections_writing down.
    […]

As I understand from your message everything works well. You should
also
check the error_log, if it doesn’t have anything suspicious then there
is
nothing to worry about.

The increased number of writing connections can be explained by
increased
concurrency. Now nginx processing cycle doesn’t block on disk and can
accept more connections at the same time. All the connections that were
waiting in listen socket before are waiting now in thread pool.

wbr, Valentin V. Bartenev

Testingin production?
Well you are not the only one… far from that.

There are other means to create ‘realistic’ traffic on benches, provided
you are thinking those tests well, though.

B. R.

=)

That module passed all our tests, both in dev and staging environments.
We
were deploying in batches and got it on the first one. We also rolled
back
before any incidents.

Not perfect but pretty safe, I think.

Maybe we need to be more ‘creative’ with our testing.