Forum: NGINX Nginx stops sending file after ~1.5mb ?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-11 14:55
(Received via mailing list)
Hello,
I'm using Nginx to serve a few TurboGears projects and some Wordpress
sites.
Mostly is does it job more than well, but lately I'm facing some
problems
serving not-so-big static files.

The files come statically, and not from an upstream server.

        location ^~ /static/  {
            root /home/user/some/static/directory;
            gzip on;
            gzip_comp_level 9;
            gzip_types text/plain text/css application/x-javascript
text/xml
application/xml application/xml+rss text/javascript;
            expires max;
        }

Try this link for example, which is a 1.9mb PDF file uploaded by one of
my
users:
http://noal.org.il/static/uploads/u-20/%5B12316152...
If I'm correct, it will stop after ~1.5mb. The speed drops to 0K/s, and
my
logs show a timeout:
2009/01/11 15:36:48 [info] 546#0: *612 client timed out (110: Connection
timed out) while sending response to client, client: xxx.xxx.xxx.xxx,
server: noal.org.il, request: "GET
/static/uploads/u-20/%5B1231615256%5Dhozer.pdf HTTP/1.0", host:
"noal.org.il
"

I've tried it both from my browser and using wget - exactly same thing.

What could be causing that?

I have
    tcp_nopush        on;
    client_max_body_size 500M;
    gzip  on;
on my http {} part
and
        client_header_buffer_size    1k;
        large_client_header_buffers  8 8k;
in the server {} one.

But commenting them didn't make a difference.

Any idea would be kindly appreciated.

Thanks.
15af4b7e3a7593e5c4adfd041de4101a?d=identicon&s=25 Jure Pečar (Guest)
on 2009-01-11 19:02
(Received via mailing list)
On Sun, 11 Jan 2009 15:43:36 +0200
"Yo'av Moshe" <bjesus@gmail.com> wrote:

> Any idea would be kindly appreciated.

What kind of OS environment are you using?
I've seen similiar happenings in VPSes where tcp send buffers run out.


--

Jure Pečar
http://jure.pecar.org/
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-11 20:03
(Received via mailing list)
Hey,
Well, I am running CentOS 5.2 in a VPS.

Is there anything I can do about it?

Thank you.
15af4b7e3a7593e5c4adfd041de4101a?d=identicon&s=25 Jure Pečar (Guest)
on 2009-01-11 20:17
(Received via mailing list)
On Sun, 11 Jan 2009 20:46:58 +0200
"Yo'av Moshe" <bjesus@gmail.com> wrote:

> Hey,
> Well, I am running CentOS 5.2 in a VPS.
>
> Is there anything I can do about it?

Check /proc/user_beancounters and ask your hosting provider to increase
those variables where you see failcnt > 0. I'm sure tcpsendbuf is one of
them.


--

Jure Pečar
http://jure.pecar.org/
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-11 20:37
(Received via mailing list)
/proc/user_beancounters doesn't exists on my system...
15af4b7e3a7593e5c4adfd041de4101a?d=identicon&s=25 Jure Pečar (Guest)
on 2009-01-11 20:44
(Received via mailing list)
On Sun, 11 Jan 2009 21:23:04 +0200
"Yo'av Moshe" <bjesus@gmail.com> wrote:

> /proc/user_beancounters doesn't exists on my system...

Then it is a kind of VPS I'm not familiar with. Check its documentation
on
how you figure out which resource limits are you hitting and work them
out
with your hosting provider.


--

Jure Pečar
http://jure.pecar.org/
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-12 11:40
(Received via mailing list)
On Sun, Jan 11, 2009 at 08:46:58PM +0200, Yo'av Moshe wrote:

> Hey,
> Well, I am running CentOS 5.2 in a VPS.
>
> Is there anything I can do about it?

It looks like VPS issue. In my test the file was being downloaded
on good speed and then has stopped:

fetch http://noal.org.il/static/uploads/u-20/%5B12316152...
%5B1231615256%5Dhozer.pdf                      69% of 1904 kB  384 kBps
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-19 00:45
(Received via mailing list)
Hey,
I talked with my VPS and they say the problem arise from my side.
It seems that only *some* files cannot be downloaded.

http://www.noal.org.il/static/temp/barvazi.pdf is a 6.3MB file, and it
downloads just fine.
http://www.noal.org.il/static/temp/barvazi2.pdf is a 1.9MB file, but it
stops after 1.5MB.
Both files are PDF files, and changing their names didn't change
anything.

Any idea? Could it still be a VPS issue? It's seems so odd I'm not sure
what
to check...

Thanks a lot.
Yo'av.
Fda08117336cfde6562315df04b976e8?d=identicon&s=25 Dave Cheney (Guest)
on 2009-01-19 01:34
(Received via mailing list)
dd if=/path/to/broken/file of=/dev/null bs=1024

^ if this fails, maybe there is some underlying filesystem corruption.

Cheers

Dave

On Mon, 19 Jan 2009 01:35:24 +0200, Yo'av Moshe <bjesus@gmail.com>
wrote:
> Hey,
> I talked with my VPS and they say the problem arise from my side.
> It seems that only *some* files cannot be downloaded.
>
> http://www.noal.org.il/static/temp/barvazi.pdf is a 6.3MB file, and it
> downloads just fine.
> http://www.noal.org.il/static/temp/barvazi2.pdf is a 1.9MB file, but it
> stops after 1.5MB.
> Both files are PDF files, and changing their names didn't change
anything.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-19 12:02
(Received via mailing list)
It doesn't fail.

1904+1 records in
> 1904+1 records out
> 1950230 bytes (2.0 MB) copied, 0.0322329 seconds, 60.5 MB/s
>
Also, I can fetch the file using SSH and it works fine, but not from
HTTP
using nginx.

Yo'av.
C87d094f9c82de9bbe00039d6d0045f3?d=identicon&s=25 Just Marc (Guest)
on 2009-01-19 13:21
(Received via mailing list)
strace your nginx instance and see what happens exactly when it stops
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-20 02:10
(Received via mailing list)
Thanks.
I'm not an strace guru but I don't think it shows anything unusual...

http://paste-it.net/public/mffe731/

You can see download starts right at the beginning, and at 02:25:25 you
can
see the last sendfile64 event. After that, no data was sent and I
couldn't
find anything related to that in this strace log.

The only thing this strace log is telling me is that nginx knows that
there
is still more data to send, since it recognized the right size of the
file
in the beginning (line 13: sendfile64(22, 23, [0], *1950230)*), and it
knows
it has more to send by the last sendfile event (line 1173:
sendfile64(22,
23, [1618265], *331965*)).

What stops the sending?

Thanks again.
Yo'av.
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-20 08:38
(Received via mailing list)
Attachment: patch.linux_sendfile (1 KB)
On Tue, Jan 20, 2009 at 02:59:34AM +0200, Yo'av Moshe wrote:

> is still more data to send, since it recognized the right size of the file
> in the beginning (line 13: sendfile64(22, 23, [0], *1950230)*), and it knows
> it has more to send by the last sendfile event (line 1173: sendfile64(22,
> 23, [1618265], *331965*)).
>
> What stops the sending?

Looking at these lines:

2140  02:25:24 epoll_wait(11,  <unfinished ...>
2140  02:25:24 <... epoll_wait resumed> {{EPOLLOUT, {u32=140043201,
u64=13817743186826159041}}}, 512, 22224) = 1
2140  02:25:24 gettimeofday({1232411124, 913871}, NULL) = 0
2140  02:25:24 sendfile64(22, 23, [1559225], 391005) = 28800

2140  02:25:24 epoll_wait(11, {{EPOLLOUT, {u32=140043201,
u64=13817743186826159041}}}, 512, 22077) = 1
2140  02:25:25 gettimeofday({1232411125, 66164}, NULL) = 0
2140  02:25:25 sendfile64(22, 23, [1588025], 362205) = 30240

2140  02:25:25 epoll_wait(11, {{EPOLLOUT, {u32=140043201,
u64=13817743186826159041}}}, 512, 21924) = 1
2140  02:25:25 gettimeofday({1232411125, 211986}, NULL) = 0
2140  02:25:25 sendfile64(22, 23, [1618265], 331965) = 28800

I see that client socket (22) is associated with
u64=13817743186826159041
in epoll. Also I see that the socket was added and modified in epoll
in the start of the transfer:

2140  02:25:12 epoll_ctl(11, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLET,
{u32=140043201
, u64=579069412499841985}}) = 0

2140  02:25:12 epoll_ctl(11, EPOLL_CTL_MOD, 22,
{EPOLLIN|EPOLLOUT|EPOLLET, {u32=140043201, u64=13817743186826159041}}) =
0

However, I do not see socket deletion from epoll or closing.
I see only that after 02:25:25 epoll_wait(), i.e. kernel, never
returned any events about 22 socket.

You may try the attached patch.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-20 10:52
(Received via mailing list)
Sorry, but it is still the same... server is now running the modified
nginx,
but this file still can't be fetched:
http://www.noal.org.il/static/temp/barvazi2.pdf .
I just can't think what the problem might be.

If it helps, my VPS is running CentOS 5.2 with kernel 2.6.18.

Any more tests I can run?

Thanks again.
Yo'av.


2009/1/20 Igor Sysoev <is@rambler-co.ru>
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-20 11:23
(Received via mailing list)
On Tue, Jan 20, 2009 at 11:45:13AM +0200, Yo'av Moshe wrote:

> Sorry, but it is still the same... server is now running the modified nginx,
> but this file still can't be fetched:
> http://www.noal.org.il/static/temp/barvazi2.pdf .
> I just can't think what the problem might be.
>
> If it helps, my VPS is running CentOS 5.2 with kernel 2.6.18.
>
> Any more tests I can run?

The patch tries to sendfile() until EAGAIN, even it already said
that it sent smaller chunk than it was requested.

Could you create a new strace and grep single pid to ensure that
nginx calls sendfile() until EAGAIN ?

If the patch does not help, I suspect a bug in kernel/VPS/etc, as kernel
does not return event via epoll_wait().
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-20 19:35
(Received via mailing list)
Hey,
It's gets even weirder. I found out that if I limit the download speed,
I
can get the whole file.

Could anyone confirm that this works:
wget --limit-rate=3500 http://www.noal.org.il/static/temp/barvazi2.pdf
while this doesn't:
wget http://www.noal.org.il/static/temp/barvazi2.pdf

Thank you.
Yo'av
D69d23d8e811e8ab2a8593380d6ede63?d=identicon&s=25 Jeff Emminger (jemminger)
on 2009-01-20 20:38
(Received via mailing list)
> Could anyone confirm that this works:
> wget --limit-rate=3500 http://www.noal.org.il/static/temp/barvazi2.pdf
> while this doesn't:
> wget http://www.noal.org.il/static/temp/barvazi2.pdf
>

Confirmed
7f006226eff37eaa07b1e8b96f521cb2?d=identicon&s=25 Kiril Angov (Guest)
on 2009-01-20 20:48
(Received via mailing list)
wget http://www.noal.org.il/static/temp/barvazi2.pdf
--2009-01-20 21:36:00--  http://www.noal.org.il/static/temp/barvazi2.pdf
Resolving www.noal.org.il... 195.28.180.147
Connecting to www.noal.org.il|195.28.180.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1950230 (1.9M) [application/pdf]
Saving to: `barvazi2.pdf'

81% [=====================================>         ] 1,596,665
--.-K/s  eta 5s


wget -c http://www.noal.org.il/static/temp/barvazi2.pdf
--2009-01-20 21:38:28--
http://www.noal.org.il/static/temp/barvazi2.pdf
Resolving www.noal.org.il... 195.28.180.147
Connecting to www.noal.org.il|195.28.180.147|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1950230 (1.9M), 305960 (299K) remaining [application/pdf]
Saving to: `barvazi2.pdf'

100%[+++++++++++++++++++++++++++++++++++++++=======>] 1,950,230
295K/s   in 1.0s

2009-01-20 21:38:30 (295 KB/s) - `barvazi2.pdf' saved [1950230/1950230]
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-21 01:06
(Received via mailing list)
Igor, here's the new strace.
http://pastebin.com/m5129c54b

Does the fact it works in low download rates still makes you think it's
a
kernel issue?

Thank you, really!
And thanks, Kiril and Jeff, for confirming that!

Yo'av.
Ef5effe1ff64e4ab2c0bb98137e99d3c?d=identicon&s=25 Resicow (Guest)
on 2009-01-21 01:27
(Received via mailing list)
I have confirmed this below as well, and yes limiting rate to
3.5Kbit/sec (so SLOW) does work.

I just want to mention, that for the past 9 months I too have received
very *random* reports of clients having their files stop downloading
after 1.5MB from my server.

My server is running on Fedora 9 with 64GB of ram (yes GB) and dual
Intel 5430's, so I would also say that the issue is not a VPS issue.

Confirmed:

wget -S http://www.noal.org.il/static/temp/barvazi2.pdf
--2009-01-20 19:02:06--  http://www.noal.org.il/static/temp/barvazi2.pdf
Resolving www.noal.org.il... 195.28.180.147
Connecting to www.noal.org.il|195.28.180.147|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx/0.7.30
  Date: Wed, 21 Jan 2009 00:01:31 GMT
  Content-Type: application/pdf
  Content-Length: 1950230
  Last-Modified: Thu, 15 Jan 2009 15:55:17 GMT
  Connection: keep-alive
  Expires: Wed, 21 Jan 2009 02:01:31 GMT
  Cache-Control: max-age=7200
  Accept-Ranges: bytes
Length: 1950230 (1.9M) [application/pdf]
Saving to: `barvazi2.pdf.1'

81% [================================================>            ]
1,596,849   --.-K/s  eta 5s
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-21 06:39
(Received via mailing list)
On Wed, Jan 21, 2009 at 01:52:53AM +0200, Yo'av Moshe wrote:

> Igor, here's the new strace.
> http://pastebin.com/m5129c54b
>
> Does the fact it works in low download rates still makes you think it's a
> kernel issue?

Yes, this is seems like a kernel issue. Kernel excatly said (with
EAGAIN)
that there is no place to output data, but never returned a new
notification
via epoll_wait() about free space:

21420 01:39:43 <... epoll_wait resumed> {{EPOLLOUT, {u32=153137960,
u64=13811628390360331048}}}, 512, 60000) = 1
21420 01:39:43 gettimeofday({1232494783, 413764}, NULL) = 0
21420 01:39:43 sendfile64(20, 21, [1645625], 304605) = 40320
21420 01:39:43 sendfile64(20, 21, [1685945], 264285) = -1 EAGAIN
(Resource temporarily unavailable)
21420 01:39:43 epoll_wait(11,  <unfinished ...>

This is probably some race condition inside kernel, that does not appear
on
low speed download.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-23 13:48
(Received via mailing list)
Thanks Igor.
I don't want to be rude, but I find it hard to believe that it's a
kernel
bug.

First, I'm using the default CentOS 5.2 kernel, it's seems to me pretty
odd
that there's such a bug in it. CentOS is being used on a lot of servers,
some of them must be running nginx.

Second, I tried serving the same file with thttpd, and it runs just
fine.
http://195.28.180.147:40/temp/hozer.pdf
Note it's the exact same file which I cannot get from nginx:
http://www.noal.org.il/static/temp/hozer.pdf

How come thttpd can serve the file just fine? If it's a kernel bug, I
thought it would happen with all of my servers.
If it's using some different way of sending the files which does not
depend
on that epoll call - is there any way to configure nginx to use that as
well?

Can I do anything to confirm that it's a kernel bug? If it is, I should
probably contact CentOS's maintainers.

Thank you, again.
Yo'av.
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-23 14:17
(Received via mailing list)
On Fri, Jan 23, 2009 at 02:35:00PM +0200, Yo'av Moshe wrote:

> Note it's the exact same file which I cannot get from nginx:
> http://www.noal.org.il/static/temp/hozer.pdf
>
> How come thttpd can serve the file just fine? If it's a kernel bug, I
> thought it would happen with all of my servers.
> If it's using some different way of sending the files which does not depend
> on that epoll call - is there any way to configure nginx to use that as
> well?

I do not look modern thttpd, but according its change log, it has no
epoll support. You may try different ways in nginx: you need to build it

        --with-rtsig_module
        --with-select_module
        --with-poll_module

and then choose method:

events {
     use  select;
     #use  poll;
     #use  rtsig;
}

Also, epoll has two modes: level triggered (default) and edge
triggered (EPOLLET). nginx uses more effective edge triggered mode,
lighttpd uses level triggered one. The bug may be in ET mode only.

As to probabilty of kernel bugs: I saw them in FreeBSD (kqueue,
sendfile),
Linux (epoll), Solaris (event ports), and MacOSX (kqueue, sendfile).

> Can I do anything to confirm that it's a kernel bug? If it is, I should
> probably contact CentOS's maintainers.

Your straces confirms this: nginx added socket to epoll and did not
deleted it, nevertheless epoll does not send event.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-25 17:39
(Received via mailing list)
Igor, I tried using select instead of poll, but I get the exact same
problem.

Check my strace:
http://pastebin.com/m65056ec0

I get the same thing with rtsig too.

Hints?
Again, file is at http://www.noal.org.il/static/temp/hozer.pdf. Server
is
using select now.

Thanks...
Yo'av.
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-25 18:51
(Received via mailing list)
On Sun, Jan 25, 2009 at 06:26:58PM +0200, Yo'av Moshe wrote:

> using select now.
According strace select() does not return event for socket 19 after
sendfile64() sent 1651005 bytes:

25951 18:14:26 select(20, [7 8 16 19], [19], NULL, {60, 0} <unfinished
...>
25951 18:14:27 <... select resumed> )   = 1 (out [19], left {58,
790000})

25951 18:14:27 gettimeofday({1232900067, 890376}, NULL) = 0
25951 18:14:27 sendfile64(19, 20, [1509465], 440765 <unfinished ...>
25951 18:14:27 <... sendfile64 resumed> ) = 141540
25951 18:14:27 sendfile64(19, 20, [1651005], 299225 <unfinished ...>
25951 18:14:27 <... sendfile64 resumed> ) = -1 EAGAIN (Resource
temporarily unavailable)

25951 18:14:27 select(20, [16 19], [19], NULL, {0, 500000} <unfinished
...>
25951 18:14:28 <... select resumed> )   = 0 (Timeout)

As you have got the same issue with all methods (even with level
triggered
select()), I believe that the bug is in sendfile. You may try to turn
it off even for the single file:

    location = /static/temp/hozer.pdf {
        sendfile off;
    }

In Linux 2.6.23 sendfile() has been rewritten to use splice framework.
The bug may be introduced while rewriting.

BTW, it seems that thttpd does not use sendfile.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-25 20:18
(Received via mailing list)
Sendfile is now off, using select, and still the same...
http://pastebin.com/m2c3ea25b

Yo'av
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-25 20:41
(Received via mailing list)
On Sun, Jan 25, 2009 at 09:04:54PM +0200, Yo'av Moshe wrote:

> Sendfile is now off, using select, and still the same...
> http://pastebin.com/m2c3ea25b

I have no idea, why it may be so.

I see that select() returned event ("out [19]"):

28215 20:50:18 select(20, [10 19], [19], NULL, {0, 500000} <unfinished
...>
28215 20:50:18 <... select resumed> )   = 1 (out [19], left {0, 380000})

However, at some stage select() did not return event, although the event
is still active ("select(20, [10 19], [19], NULL ..."):

28215 20:50:18 gettimeofday({1232909418, 293895}, NULL) = 0
28215 20:50:18 writev(19,
[{"\227\1\2\1\3\1\4\1\5\0\373\0\374\1\230\1\231\1\232\1\233\0\375\0\376\1\6\1\7\1\10\0"...,
9676}], 1) = 9676
28215 20:50:18 pread64(20,
"\24\7\6#\"\'&\0214\22$63\1\6\25\24\26\27\02632654&#\"\6\3\226\204"...,
32768, 1605632) = 32768
28215 20:50:18 writev(19,
[{"\24\7\6#\"\'&\0214\22$63\1\6\25\24\26\27\02632654&#\"\6\3\226\204"...,
32768}], 1) = 32768
28215 20:50:18 pread64(20, "\0 \0M\0o\0r\0i\0s\0o\0n\0\'\0s\0
\0d\0i\0r\0e\0c"..., 32768, 1638400) = 32768
28215 20:50:18 writev(19, [{"\0 \0M\0o\0r\0i\0s\0o\0n\0\'\0s\0
\0d\0i\0r\0e\0c"..., 32768}], 1) = 1716
28215 20:50:18 select(20, [10 19], [19], NULL, {0, 500000} <unfinished
...>
28215 20:50:18 <... select resumed> )   = 0 (Timeout)

Could you run tcpdump of the session ?
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-26 13:17
(Received via mailing list)
Attachment: nginx.tcpdump.log.gz (30 KB)
Hey, attached is my tcpdump log.
I've never used tcpdump before, so I hope the log is fine.
I used grep to show only the connections from my computer while running
'wget http://www.noal.org.il/static/temp/hozer.pdf'.

Hope this will lead us somewhere...

Yo'av.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-27 09:53
(Received via mailing list)
Note I just compiled Cherokee 0.98. It is serving the file as it should
using sendfile and epoll.
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-27 10:01
(Received via mailing list)
On Tue, Jan 27, 2009 at 10:42:57AM +0200, Yo'av Moshe wrote:

> Note I just compiled Cherokee 0.98. It is serving the file as it should
> using sendfile and epoll.

Have you tried this on the same host ?
Could you post strace ?
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-27 10:13
(Received via mailing list)
On Mon, Jan 26, 2009 at 02:05:28PM +0200, Yo'av Moshe wrote:

> Hey, attached is my tcpdump log.
> I've never used tcpdump before, so I hope the log is fine.
> I used grep to show only the connections from my computer while running
> 'wget http://www.noal.org.il/static/temp/hozer.pdf'.
>
> Hope this will lead us somewhere...

In this tcpdump I see the only strange thing: client acknowledged
data that was ~64K before (1534561), and this may be usual thing,
then server started to retransmit data and client never acknowledged
the data (1535941).

13:53:53.739626 49926 > http: . 125:125(0) ack 1534561 win 65535
13:53:53.739679 http > 49926: . 1596661:1598041(1380) ack 125 win 5840
13:53:53.739865 http > 49926: . 1598041:1599421(1380) ack 125 win 5840
13:53:54.577857 http > 49926: . 1534561:1535941(1380) ack 125 win 5840
13:53:56.256478 http > 49926: . 1534561:1535941(1380) ack 125 win 5840

However, when I have tried to look tcpdump on my side, I saw the my
client sent all acknowledges, but did not get a new data on some point.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-27 10:34
(Received via mailing list)
Sure, here my Cherokee's strace:
http://pastebin.com/m13cd213d

What you are saying about tcpdump's logs is strange, but I have no idea
what's causing it.

Yo'av.
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-27 10:55
(Received via mailing list)
Attachment: patch.nodelay (833 Bytes)
On Tue, Jan 27, 2009 at 11:25:09AM +0200, Yo'av Moshe wrote:

> Sure, here my Cherokee's strace:
> http://pastebin.com/m13cd213d
>
> What you are saying about tcpdump's logs is strange, but I have no idea
> what's causing it.

I see only two difference with nginx:

1) it uses level triggered epoll (nginx uses edge triggered epoll),
2) and it uses TCP_NODELAY for first request (nginx set TCP_NODELAY
   only just before going to keep-alive).

The attached patch sets TCP_NODELAY from very start.
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-27 11:10
(Received via mailing list)
Applied the patch, still the same.
The other only difference I can think about is that I'm running Cherokee
on
a different port. Maybe my VPS is handling them differently.
I'll try testing it on the same port this evening and let you know.

Thank you.

Yo'av
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-27 11:18
(Received via mailing list)
On Tue, Jan 27, 2009 at 11:58:58AM +0200, Yo'av Moshe wrote:

> Applied the patch, still the same.
> The other only difference I can think about is that I'm running Cherokee on
> a different port. Maybe my VPS is handling them differently.
> I'll try testing it on the same port this evening and let you know.

Could you run another nginx on this port ?
Or even the same nginx:

        server {
            listen  80;
            listen  PORT;

            ...
        }
5628a6f4d3ab86b48aef276247790766?d=identicon&s=25 Yo'av Moshe (Guest)
on 2009-01-27 11:35
(Received via mailing list)
Right, it's working on the other port.
I'll check it again with my VPS.

Sorry for all that.

Yo'av.
Fda08117336cfde6562315df04b976e8?d=identicon&s=25 Dave Cheney (Guest)
on 2009-01-27 12:14
(Received via mailing list)
Hi Yo'av,

If you get an answer from your VPS host would you please post it too
the list. There were others who mentioned earlier on in the thread
that they had occasionally experienced your symptoms, your feedback
may help them.

Cheers

Dave
5640e332954fc0006aea97a155ce0afd?d=identicon&s=25 Igor Sysoev (Guest)
on 2009-01-27 12:30
(Received via mailing list)
On Tue, Jan 27, 2009 at 12:24:50PM +0200, Yo'av Moshe wrote:

> Right, it's working on the other port.
> I'll check it again with my VPS.
>
> Sorry for all that.

OK. Please, when issue will be resolved, post the cause in the list.
15af4b7e3a7593e5c4adfd041de4101a?d=identicon&s=25 Jure Pečar (Guest)
on 2009-01-28 23:00
(Received via mailing list)
On Tue, 27 Jan 2009 22:03:43 +1100
Dave Cheney <dave@cheney.net> wrote:

> Hi Yo'av,
>
> If you get an answer from your VPS host would you please post it too
> the list. There were others who mentioned earlier on in the thread
> that they had occasionally experienced your symptoms, your feedback
> may help them.

What I've expirienced on Parallels Virtuozzo VPSes is that if tcp send
and/or receive buffer gets full, stuf like this happens. I was able to
workaround that issue by making those buffers larger.

I never investigated the issue in detail, but looking at Igor's comment
on
the tcp dump, it makes sense that a missing ack for a packet makes the
whole tcp session stall.

If we get some more hard evidence, I can bring this issue to Parallels
support and see what they say.


--

Jure Pečar
http://jure.pecar.org
http://f5j.eu
This topic is locked and can not be replied to.