All workers in 'D' state using sendfile

Hello,

I have tried to summarize this as much as possible but it’s still a lot
of
text. I apologize but wanted to make sure that I provide enough
information to explain the issue properly.

I’m hoping that somebody that uses nginx as a high traffic/concurrency
download server will be able to shed some light on this issue. I’ve
tried
as many things as I can think of and everything keeps pointing to it
being
an issue with nginx, not the server - but I am of course more than
willing
to try any suggestions provided.

Background:
Approx. 1,500 - 5,000 concurrent connections (peak / off-peak),
Files vary in size from 5MB to 2GB,
All downloads; only very small dynamic content scripts run on these
servers
and none take more than 1-3 seconds,
File are hosted on direct-attached AoE storage with a dedicated 10GE
link,
Server is running nginx-1.0.11, php-fpm 5.3 and CentOS 5.8x64
(2.6.18-308.4.1.el5.centos.plus).
Specs are: Dual Xeon E5649 (6 Core), 32GB RAM, 300GB 10k SAS HDD, AoE
DAS
over 10GE
Download speeds are restricted by the PHP handoff using
X-Accel-Redirect,
but obviously not when I’m testing :wink:

Issue:
After running for a short, but random period of time (5min ~ 90min) all
nginx workers will eventually end up in a ‘D’ state according to ps/top.
This causes all downloads to run extremely slowly (~25kb/s) but it
doesn’t
seem to be caused by I/O because an scp of the same file will complete
at
the expected speed of ~750MB+/s.

I usually run with worker_processes set to 13, but I’ve had to raise
this
to 50 to prevent the issue. This works short term, but I’m guessing
eventually I will need to restart nginx to fix it.

Config:
I’m using sendfile with epoll, and using the following events / http
settings (I’ve removed the location block with the fastcgi handler,
etc):

events {
worker_connections 16384;
use epoll;
}

http {

sendfile                        on;
tcp_nopush                      on;
tcp_nodelay                     on;
keepalive_timeout               0;

....

location /internalAccess/ {
    internal;
    alias                   /data/;
}

}

Kind Regards,

Drew

Hello!

On Sat, May 12, 2012 at 08:28:14PM +1000, Drew Wareham wrote:

to try any suggestions provided.
over 10GE
I usually run with worker_processes set to 13, but I’ve had to raise this
to 50 to prevent the issue. This works short term, but I’m guessing
eventually I will need to restart nginx to fix it.

Config:
I’m using sendfile with epoll, and using the following events / http
settings (I’ve removed the location block with the fastcgi handler, etc):

With rotational disks you have to optimize iops to minimize seeks.
This includes:

  1. Switch off sendfile, it works bad on such workloads under linux
    due to no ability to control readahead (and hence blocks read from
    disk).

  2. Use large output buffers, something like

    output_buffers 1 512k

would be a good starting point.

  1. Try using aio to ensure better disk concurrency (and note under
    linux it needs directio as well), i.e. something like this

    aio on;
    directio 512;

(this will require newer kernel though, but using 2.6.18 nowadays
looks like bad idea, at least if you need speed)

  1. Try tuning io scheduler, there have been reports that deadline
    might be better for such workloads.

More details can be found here:

http://nginx.org/r/output_buffers
http://nginx.org/r/aio
http://nginx.org/r/directio

Maxim D.

Hi Maxim,

Thanks for your reply and sorry for the delay in responding!

I’ve applied your suggested changes to three servers in the cluster -
hopefully that will give me an accurate idea of their effectiveness.
I’ll
report back when I have more useful info.

Thanks again,

Drew

Hi,

Unfortunately the changes did not seem to be the silver bullet I was
hoping
for - however - although the processes are still sleeping, and the
servers
are still getting very high loads, it does seem to have helped -
downloads
are no longer unable to start, which is great!

I’m not sure if it’s helpful, but an strace on a sleeping process looks
like this:

io_submit(47044328767488, 1, {{0x13dab860, 0, 0, 0, 66}}) = 1
epoll_wait(9, {{EPOLLOUT, {u32=1554189657, u64=47044330982745}},
{EPOLLOUT,
{u32=1554179536, u64=47044330972624}}}, 512, 36) = 2
writev(250,
[{"\321\4\vS\0313\237F\222\337\246\322\33(\30\245=g\352\271\2\361\244p\240\377Q\314\2371^\256"…,
161636}], 1) = 20440
writev(69,
[{"\235\337}\33\201\214)\321\343\233\22\346$z\374\2126T\j\210\250L\250\331{\220\333\323\343\341J"…,
386840}], 1) = 130320
epoll_wait(9, {{EPOLLOUT, {u32=1554194624, u64=47044330987712}}}, 512,
9) =
1
writev(222,
[{"<\247\260P\237\2455\236\244\352!\237s\223h\25\206N3[{\351f\31\275\344b\5\204\f\v"..., 396976}], 1) = 95568 epoll_wait(9, {{EPOLLOUT, {u32=1554205480, u64=47044330998568}}}, 512, 26) = 1 writev(286, [{"E\260q\214\346X[\376\305\5\275\352\344\256q\327\344m\r\236\t\321\354\200\325\333\351E\340\374\232"…,
240116}], 1) = 64240
epoll_wait(9, {{EPOLLOUT, {u32=1554195361, u64=47044330988449}}}, 512,
25)
= 1
writev(133,
[{"\243Y\373y\10\0252\34\32\22\2\36\227\325e\345\333\372=\365./\340\34V\251U\0373\234\35\250"…,
13732}], 1) = 13732
io_submit(47044328767488, 1, {{0x12599628, 0, 0, 0, 209}}) = 1
epoll_wait(9, {{EPOLLOUT, {u32=1554199961, u64=47044330993049}}}, 512,
5) =
1
writev(49,
[{"+\347^\17\322\354\201\20=\35\246b\200\0\214K’z>\344k\331\272Svh\234`\334)|\205"…,
176592}], 1) = 84120
epoll_wait(9, {}, 512, 1) = 0
epoll_wait(9, {}, 512, 4) = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554179905, u64=47044330972993}}}, 512,
14)
= 1
epoll_wait(9, {{EPOLLOUT, {u32=1554193521, u64=47044330986609}}}, 512,
10)
= 1
writev(137,
[{"\212\375\216\330’\315^\20|\350N\362\25j\272\304=v\227\210?\3539S\343\6D\265C-\360J"…,
336856}], 1) = 96360
epoll_wait(9, {{EPOLLOUT, {u32=1554181193, u64=47044330974281}}}, 512,
9) =
1
writev(79,
[{"\321\277\340\360E\323A\352\352\377\357w\357m_\377\377R\0\200\177\365l\200
\314D\24z\21U\0"…, 228056}], 1) = 128480
epoll_wait(9, {}, 512, 3) = 0
epoll_wait(9, {}, 512, 8) = 0
epoll_wait(9, {}, 512, 2) = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554204009, u64=47044330997097}}}, 512,
26)
= 1
writev(67, [{"\204-&
V\325?\375\33\202B\236\216\r\240\360\17\0103\25\274\3\300>\352\267\211BJ\265\23\327"…,
166588}], 1) = 26280
epoll_wait(9, {}, 512, 12) = 0
epoll_wait(9, {{EPOLLIN, {u32=6779072, u64=6779072}}}, 512, 14) = 1
read(10, “\1\0\0\0\0\0\0\0”, 8) = 8
io_getevents(47044328767488, 1, 64, {{0x13dab860, 0x13dab820, 524288,
0}},
{0, 0}) = 1
writev(80,
[{“X\361N8\2\214\203n\263t\240\\335\241k\212N\366\24\222\32\201u\267\272\32\v\326=\373\34\v”…,
524288}], 1) = 56608
epoll_wait(9, {{EPOLLOUT, {u32=1554187265, u64=47044330980353}}}, 512,
13)
= 1
epoll_wait(9, {{EPOLLOUT, {u32=1554183217, u64=47044330976305}}}, 512,
3) =
1
epoll_wait(9, {}, 512, 1) = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554181744, u64=47044330974832}}}, 512,
13)
= 1
writev(121,
[{"\371s\222d\231\313\17\t\227\31\33a\315\304\365NZ7\323\200\347\337\260\355\253\203\30\215N\313\260d"…,
331027}], 1) = 49640
epoll_wait(9, {}, 512, 2) = 0
epoll_wait(9, {}, 512, 8) = 0
epoll_wait(9, {{EPOLLOUT, {u32=1554192968, u64=47044330986056}}}, 512,
12)
= 1
epoll_wait(9, {{EPOLLIN, {u32=6779072, u64=6779072}}}, 512, 12) = 1
read(10, “\1\0\0\0\0\0\0\0”, 8) = 8
io_getevents(47044328767488, 1, 64, {{0x12599628, 0x125995e8, 524288,
0}},
{0, 0}) = 1
writev(133,
[{"&yj\373dw\335\364\232k\310\6\204\363\365=c{V\257\6:\225\354\233\253b\27*\221\4\264"…,
524288}], 1) = 114748
epoll_wait(9, {{EPOLLOUT, {u32=1554203825, u64=47044330996913}}}, 512,
6) =
1
epoll_wait(9, {{EPOLLOUT, {u32=1554179536, u64=47044330972624}}}, 512,
5) =
1
writev(69,
[{“X\4\250\274\3415\212A\20D\30\2122.\23\351n%\226\245\250\242b$\271\t\22/fX\303\263”…,
256520}], 1) = 130320
epoll_wait(9, {{EPOLLOUT, {u32=1554192784, u64=47044330985872}}}, 512,
27)
= 1

An strace -c taken only when a process is in the ‘D’ state shows:

[root@HOST16 ~]# time strace -p 22334 -c
Process 22334 attached - interrupt to quit
Process 22334 detached
% time seconds usecs/call calls errors syscall


70.65 0.008273 109 76 io_submit
29.35 0.003437 10 360 writev
0.00 0.000000 0 26 read
0.00 0.000000 0 3 open
0.00 0.000000 0 8 close
0.00 0.000000 0 3 fstat
0.00 0.000000 0 1 ioctl
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 1 connect
0.00 0.000000 0 11 3 recvfrom
0.00 0.000000 0 1 getsockname
0.00 0.000000 0 3 getsockopt
0.00 0.000000 0 44 fcntl
0.00 0.000000 0 26 io_getevents
0.00 0.000000 0 180 epoll_wait
0.00 0.000000 0 4 epoll_ctl


100.00 0.011710 748 4 total

real 0m8.570s
user 0m0.016s
sys 0m0.028s

When it’s out of sleeping state, about 25% of the time is spent in
epoll_wait. I’ve also noticed that vmstat shows far less frequent
swapping, although now instead of 5-10MB regularly, it’ll swap ~100MB
every
30+ seconds.

Cheers,

Drew

Hi Maxim,

Is there any planning to develop the thread version?

2012/5/12 Maxim D. [email protected]:

If you comment/remove “use epoll;” does that help?

Hi Ryan,

If I comment out use epoll;, it still seems to use epoll according to
strace. These are the compile flags I use:

./configure --prefix=/usr/local/nginx-1.2.0 --with-pcre
–add-module=/usr/local/src/nginx_upload_module-2.2.0
–with-http_stub_status_module --with-file-aio
–without-http_proxy_module
–without-http_memcached_module --without-http_gzip_module
–without-http_ssi_module --without-http_userid_module
–without-http_autoindex_module --without-http_geo_module
–without-http_map_module --without-http_empty_gif_module
–without-http_browser_module --without-http_upstream_ip_hash_module
–without-http_charset_module

This is what I’m running with now:

events {
worker_connections 51200;

use epoll;

multi_accept on;

}

http {
include mime.types;
default_type application/octet-stream;

    server_tokens                   off;
    sendfile                        off;
    tcp_nopush                      on;
    tcp_nodelay                     on;
    keepalive_timeout               10;

    aio                             on;
    directio                        4k;
    output_buffers                  1 512k;
    max_ranges                      5;
    ...

Cheers,

Drew