Nginx stops sending file after ~1.5mb?

YoSav_Moshe · January 11, 2009, 2:55pm

Hello,
I’m using Nginx to serve a few TurboGears projects and some Wordpress
sites.
Mostly is does it job more than well, but lately I’m facing some
problems
serving not-so-big static files.

The files come statically, and not from an upstream server.

    location ^~ /static/  {
        root /home/user/some/static/directory;
        gzip on;
        gzip_comp_level 9;
        gzip_types text/plain text/css application/x-javascript

text/xml
application/xml application/xml+rss text/javascript;
expires max;
}

Try this link for example, which is a 1.9mb PDF file uploaded by one of
my
users:
http://noal.org.il/static/uploads/u-20/[1231615256]hozer.pdf
If I’m correct, it will stop after ~1.5mb. The speed drops to 0K/s, and
my
logs show a timeout:
2009/01/11 15:36:48 [info] 546#0: *612 client timed out (110: Connection
timed out) while sending response to client, client: xxx.xxx.xxx.xxx,
server: noal.org.il, request: “GET
/static/uploads/u-20/%5B1231615256%5Dhozer.pdf HTTP/1.0”, host:
"noal.org.il
"

I’ve tried it both from my browser and using wget - exactly same thing.

What could be causing that?

I have
tcp_nopush on;
client_max_body_size 500M;
gzip on;
on my http {} part
and
client_header_buffer_size 1k;
large_client_header_buffers 8 8k;
in the server {} one.

But commenting them didn’t make a difference.

Any idea would be kindly appreciated.

Thanks.

YoSav_Moshe · January 11, 2009, 7:02pm

On Sun, 11 Jan 2009 15:43:36 +0200
“Yo’av Moshe” [email protected] wrote:

Any idea would be kindly appreciated.

What kind of OS environment are you using?
I’ve seen similiar happenings in VPSes where tcp send buffers run out.

–

Jure PeÄar
http://jure.pecar.org/

YoSav_Moshe · January 11, 2009, 8:03pm

Hey,
Well, I am running CentOS 5.2 in a VPS.

Is there anything I can do about it?

Thank you.

YoSav_Moshe · January 11, 2009, 8:37pm

/proc/user_beancounters doesn’t exists on my system…

YoSav_Moshe · January 11, 2009, 8:17pm

On Sun, 11 Jan 2009 20:46:58 +0200
“Yo’av Moshe” [email protected] wrote:

Hey,
Well, I am running CentOS 5.2 in a VPS.

Is there anything I can do about it?

Check /proc/user_beancounters and ask your hosting provider to increase
those variables where you see failcnt > 0. I’m sure tcpsendbuf is one of
them.

–

Jure PeÄar
http://jure.pecar.org/

YoSav_Moshe · January 11, 2009, 8:44pm

On Sun, 11 Jan 2009 21:23:04 +0200
“Yo’av Moshe” [email protected] wrote:

/proc/user_beancounters doesn’t exists on my system…

Then it is a kind of VPS I’m not familiar with. Check its documentation
on
how you figure out which resource limits are you hitting and work them
out
with your hosting provider.

–

Jure PeÄar
http://jure.pecar.org/

YoSav_Moshe · January 12, 2009, 11:40am

On Sun, Jan 11, 2009 at 08:46:58PM +0200, Yo’av Moshe wrote:

Hey,
Well, I am running CentOS 5.2 in a VPS.

Is there anything I can do about it?

It looks like VPS issue. In my test the file was being downloaded
on good speed and then has stopped:

fetch http://noal.org.il/static/uploads/u-20/[1231615256]hozer.pdf
%5B1231615256%5Dhozer.pdf 69% of 1904 kB 384 kBps

YoSav_Moshe · January 19, 2009, 1:34am

dd if=/path/to/broken/file of=/dev/null bs=1024

^ if this fails, maybe there is some underlying filesystem corruption.

Cheers

Dave

On Mon, 19 Jan 2009 01:35:24 +0200, Yo’av Moshe [email protected]
wrote:

Hey,
I talked with my VPS and they say the problem arise from my side.
It seems that only some files cannot be downloaded.

http://www.noal.org.il/static/temp/barvazi.pdf is a 6.3MB file, and it
downloads just fine.
http://www.noal.org.il/static/temp/barvazi2.pdf is a 1.9MB file, but it
stops after 1.5MB.
Both files are PDF files, and changing their names didn’t change
anything.

YoSav_Moshe · January 19, 2009, 12:02pm

It doesn’t fail.

1904+1 records in

1904+1 records out
1950230 bytes (2.0 MB) copied, 0.0322329 seconds, 60.5 MB/s

Also, I can fetch the file using SSH and it works fine, but not from
HTTP
using nginx.

Yo’av.

YoSav_Moshe · January 19, 2009, 1:21pm

strace your nginx instance and see what happens exactly when it stops

YoSav_Moshe · January 19, 2009, 12:45am

Hey,
I talked with my VPS and they say the problem arise from my side.
It seems that only some files cannot be downloaded.

http://www.noal.org.il/static/temp/barvazi.pdf is a 6.3MB file, and it
downloads just fine.
http://www.noal.org.il/static/temp/barvazi2.pdf is a 1.9MB file, but it
stops after 1.5MB.
Both files are PDF files, and changing their names didn’t change
anything.

Any idea? Could it still be a VPS issue? It’s seems so odd I’m not sure
what
to check…

Thanks a lot.
Yo’av.

YoSav_Moshe · January 20, 2009, 8:38am

On Tue, Jan 20, 2009 at 02:59:34AM +0200, Yo’av Moshe wrote:

is still more data to send, since it recognized the right size of the file
in the beginning (line 13: sendfile64(22, 23, [0], 1950230)), and it knows
it has more to send by the last sendfile event (line 1173: sendfile64(22,
23, [1618265], 331965)).

What stops the sending?

Looking at these lines:

2140 02:25:24 epoll_wait(11, <unfinished …>
2140 02:25:24 <… epoll_wait resumed> {{EPOLLOUT, {u32=140043201,
u64=13817743186826159041}}}, 512, 22224) = 1
2140 02:25:24 gettimeofday({1232411124, 913871}, NULL) = 0
2140 02:25:24 sendfile64(22, 23, [1559225], 391005) = 28800

2140 02:25:24 epoll_wait(11, {{EPOLLOUT, {u32=140043201,
u64=13817743186826159041}}}, 512, 22077) = 1
2140 02:25:25 gettimeofday({1232411125, 66164}, NULL) = 0
2140 02:25:25 sendfile64(22, 23, [1588025], 362205) = 30240

2140 02:25:25 epoll_wait(11, {{EPOLLOUT, {u32=140043201,
u64=13817743186826159041}}}, 512, 21924) = 1
2140 02:25:25 gettimeofday({1232411125, 211986}, NULL) = 0
2140 02:25:25 sendfile64(22, 23, [1618265], 331965) = 28800

I see that client socket (22) is associated with
u64=13817743186826159041
in epoll. Also I see that the socket was added and modified in epoll
in the start of the transfer:

2140 02:25:12 epoll_ctl(11, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLET,
{u32=140043201
, u64=579069412499841985}}) = 0

2140 02:25:12 epoll_ctl(11, EPOLL_CTL_MOD, 22,
{EPOLLIN|EPOLLOUT|EPOLLET, {u32=140043201, u64=13817743186826159041}}) =
0

However, I do not see socket deletion from epoll or closing.
I see only that after 02:25:25 epoll_wait(), i.e. kernel, never
returned any events about 22 socket.

You may try the attached patch.

YoSav_Moshe · January 20, 2009, 10:52am

Sorry, but it is still the same… server is now running the modified
nginx,
but this file still can’t be fetched:
http://www.noal.org.il/static/temp/barvazi2.pdf .
I just can’t think what the problem might be.

If it helps, my VPS is running CentOS 5.2 with kernel 2.6.18.

Any more tests I can run?

Thanks again.
Yo’av.

2009/1/20 Igor S. [email protected]

YoSav_Moshe · January 20, 2009, 11:23am

On Tue, Jan 20, 2009 at 11:45:13AM +0200, Yo’av Moshe wrote:

Sorry, but it is still the same… server is now running the modified nginx,
but this file still can’t be fetched:
http://www.noal.org.il/static/temp/barvazi2.pdf .
I just can’t think what the problem might be.

If it helps, my VPS is running CentOS 5.2 with kernel 2.6.18.

Any more tests I can run?

The patch tries to sendfile() until EAGAIN, even it already said
that it sent smaller chunk than it was requested.

Could you create a new strace and grep single pid to ensure that
nginx calls sendfile() until EAGAIN ?

If the patch does not help, I suspect a bug in kernel/VPS/etc, as kernel
does not return event via epoll_wait().

YoSav_Moshe · January 20, 2009, 2:10am

Thanks.
I’m not an strace guru but I don’t think it shows anything unusual…

http://paste-it.net/public/mffe731/

You can see download starts right at the beginning, and at 02:25:25 you
can
see the last sendfile64 event. After that, no data was sent and I
couldn’t
find anything related to that in this strace log.

The only thing this strace log is telling me is that nginx knows that
there
is still more data to send, since it recognized the right size of the
file
in the beginning (line 13: sendfile64(22, 23, [0], 1950230)), and it
knows
it has more to send by the last sendfile event (line 1173:
sendfile64(22,
23, [1618265], 331965)).

What stops the sending?

Thanks again.
Yo’av.

YoSav_Moshe · January 20, 2009, 8:38pm

Could anyone confirm that this works:
wget --limit-rate=3500 http://www.noal.org.il/static/temp/barvazi2.pdf
while this doesn’t:
wget http://www.noal.org.il/static/temp/barvazi2.pdf

Confirmed

YoSav_Moshe · January 20, 2009, 7:35pm

Hey,
It’s gets even weirder. I found out that if I limit the download speed,
I
can get the whole file.

Could anyone confirm that this works:
wget --limit-rate=3500 http://www.noal.org.il/static/temp/barvazi2.pdf
while this doesn’t:
wget http://www.noal.org.il/static/temp/barvazi2.pdf

Thank you.
Yo’av

YoSav_Moshe · January 20, 2009, 8:48pm

wget http://www.noal.org.il/static/temp/barvazi2.pdf
–2009-01-20 21:36:00-- http://www.noal.org.il/static/temp/barvazi2.pdf
Resolving www.noal.org.il… 195.28.180.147
Connecting to www.noal.org.il|195.28.180.147|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 1950230 (1.9M) [application/pdf]
Saving to: `barvazi2.pdf’

81% [=====================================> ] 1,596,665
–.-K/s eta 5s

wget -c http://www.noal.org.il/static/temp/barvazi2.pdf
–2009-01-20 21:38:28–
http://www.noal.org.il/static/temp/barvazi2.pdf
Resolving www.noal.org.il… 195.28.180.147
Connecting to www.noal.org.il|195.28.180.147|:80… connected.
HTTP request sent, awaiting response… 206 Partial Content
Length: 1950230 (1.9M), 305960 (299K) remaining [application/pdf]
Saving to: `barvazi2.pdf’

100%[+++++++++++++++++++++++++++++++++++++++=======>] 1,950,230
295K/s in 1.0s

2009-01-20 21:38:30 (295 KB/s) - `barvazi2.pdf’ saved [1950230/1950230]

YoSav_Moshe · January 21, 2009, 1:06am

Igor, here’s the new strace.
http://pastebin.com/m5129c54b

Does the fact it works in low download rates still makes you think it’s
a
kernel issue?

Thank you, really!
And thanks, Kiril and Jeff, for confirming that!

Yo’av.

YoSav_Moshe · January 21, 2009, 6:39am

On Wed, Jan 21, 2009 at 01:52:53AM +0200, Yo’av Moshe wrote:

Igor, here’s the new strace.
http://pastebin.com/m5129c54b

Does the fact it works in low download rates still makes you think it’s a
kernel issue?

Yes, this is seems like a kernel issue. Kernel excatly said (with
EAGAIN)
that there is no place to output data, but never returned a new
notification
via epoll_wait() about free space:

21420 01:39:43 <… epoll_wait resumed> {{EPOLLOUT, {u32=153137960,
u64=13811628390360331048}}}, 512, 60000) = 1
21420 01:39:43 gettimeofday({1232494783, 413764}, NULL) = 0
21420 01:39:43 sendfile64(20, 21, [1645625], 304605) = 40320
21420 01:39:43 sendfile64(20, 21, [1685945], 264285) = -1 EAGAIN
(Resource temporarily unavailable)
21420 01:39:43 epoll_wait(11, <unfinished …>

This is probably some race condition inside kernel, that does not appear
on
low speed download.