Client abort detection w/ upstream broken on anything but kqueue ()?

Hello,

I’m investigating why client aborts are successfully detected
by nginx when using kqueue () on BSD systems and not when using
poll () on BSD & Linux or epoll () on Linux systems.

I’ve done debug sessions and I nailed down the problem to the
ngx_http_upstream_check_broken_connection function, line 917 of
http/ngx_http_upstream.c of release 0.8.53.

From my tests on a Darwin 10.4.0 using poll () and on a Linux
2.6.32 using epoll () a client that closes its socket makes the
recv () on line 994 return 1, with errno set to EWOULDBLOCK on
Darwin/poll () and to EAGAIN on Linux/epoll ().

I’ve tried to change the condition at line 1016 from n > 0 to
(n > 0 && err != NGX_EAGAIN && err != NGX_EWOULDBLOCK) but I’m
sure it doesn’t make much sense.

Is this a known issue? From where should I start to solve it?

Thank you,

~Marcello

~ [email protected]
~ http://www.linkedin.com/in/marcellobarnaba
~ http://sindro.me/

Hello!

On Tue, Nov 02, 2010 at 10:14:34PM +0100, Marcello B. wrote:

From my tests on a Darwin 10.4.0 using poll () and on a Linux
2.6.32 using epoll () a client that closes its socket makes the
recv () on line 994 return 1, with errno set to EWOULDBLOCK on
Darwin/poll () and to EAGAIN on Linux/epoll ().

Looking into errno without getting error from function is
meaningless[1].

[1] errno

I’ve tried to change the condition at line 1016 from n > 0 to
(n > 0 && err != NGX_EAGAIN && err != NGX_EWOULDBLOCK) but I’m
sure it doesn’t make much sense.

Is this a known issue? From where should I start to solve it?

As long as recv() returned 1 - it means that you have outstanding
data in connection (e.g. some pipelined request) and it’s
impossible to detect if connection was closed or not without
reading this data or writing something to connection (or some
out-of-band hint from OS as kqueue provides).

At the point in question it’s not possible to read all data or
write something, so basically premature connection close by client
with outstanding data is undetectable with classic socket
interface.

This shouldn’t be a major issue though:

a) normally connections doesn’t have outstanding data when user
cancels request (closes browser window, hits “stop” and so on) and
the detection works well even without kqueue;

b) this is only optimization anyway.

Maxim D.

Hello,

On Nov 3, 2010, at 1:11 AM, Maxim D. wrote:

Looking into errno without getting error from function is
meaningless[1].

[1] errno

Right

I’ve tried to change the condition at line 1016 from n > 0 to
(n > 0 && err != NGX_EAGAIN && err != NGX_EWOULDBLOCK) but I’m
sure it doesn’t make much sense.

Is this a known issue? From where should I start to solve it?

As long as recv() returned 1 - it means that you have outstanding
data in connection (e.g. some pipelined request) and it’s
impossible to detect if connection was closed or not without
reading this data or writing something to connection (or some
out-of-band hint from OS as kqueue provides).

I’ve tried reading the whole data while in GDB, and this is the
result (Darwin/poll()):

(gdb) l
989
990 #endif
991
992 ngx_debug_point();
993
994 n = recv(c->fd, buf, 1, MSG_PEEK);
995
996 err = ngx_socket_errno;
997
998 ngx_log_debug1(NGX_LOG_DEBUG_HTTP, ev->log, err,

(gdb) p buf = malloc(10000)
$2 = “”

(gdb) p (int)recv(c->fd, buf, 10000, 0)
$3 = 37

(gdb) x/37b buf
0x7fff5fbff06f: 0x15 0x03 0x01 0x00 0x20 0x5f 0x01 0x61
0x7fff5fbff077: 0x9d 0x83 0xa8 0x74 0xaa 0xcc 0xf6 0x78
0x7fff5fbff07f: 0x81 0x42 0x98 0x20 0x08 0xe3 0x66 0x21
0x7fff5fbff087: 0x52 0x3a 0xca 0xe2 0x08 0xac 0x98 0xcf
0x7fff5fbff08f: 0x74 0x5c 0xa4 0x06 0xd5

do this binary data make any sense?

At the point in question it’s not possible to read all data or
write something, so basically premature connection close by client
with outstanding data is undetectable with classic socket
interface.

This shouldn’t be a major issue though:

a) normally connections doesn’t have outstanding data when user
cancels request (closes browser window, hits “stop” and so on) and
the detection works well even without kqueue;

b) this is only optimization anyway.

In my case, the client connection is initiated by XHR and it’s a
long-polling request towards an Erlang server that implements a
web-based chat system: the recv () return value is very strange,
because there’s really no outstanding data to receive from the
client. From the nginx logs I see that the only data sent by the
browser is forwarded to the upstream.

Moreover, I’m using proxy_read_timeout 300, to reduce the number
of long-polling requests, and I need to know when a connection
is closed to kick off users from chat rooms after a timeout has
passed. But:

  • nginx maintains duplicate open connections to the server even
    when the client has closed the browser window or navigated to
    a page without the client javascript;

  • my ‘client disconnection code’ in the server doesn’t work at all
    until the nginx read timeout expires;

  • javascript-initiated “logoff” commands on e.g. window.onUnload
    aren’t portable and reliable as (I thought) socket events :slight_smile:

Thank you for your answer,

~Marcello

~ [email protected]
~ http://www.linkedin.com/in/marcellobarnaba
~ http://sindro.me/

Hello!

On Wed, Nov 03, 2010 at 12:21:47PM +0100, Marcello B. wrote:

[…]

(gdb) x/37b buf
0x7fff5fbff06f: 0x15 0x03 0x01 0x00 0x20 0x5f 0x01 0x61
0x7fff5fbff077: 0x9d 0x83 0xa8 0x74 0xaa 0xcc 0xf6 0x78
0x7fff5fbff07f: 0x81 0x42 0x98 0x20 0x08 0xe3 0x66 0x21
0x7fff5fbff087: 0x52 0x3a 0xca 0xe2 0x08 0xac 0x98 0xcf
0x7fff5fbff08f: 0x74 0x5c 0xa4 0x06 0xd5

do this binary data make any sense?

See nothing familiar.

[…]

In my case, the client connection is initiated by XHR and it’s a
long-polling request towards an Erlang server that implements a
web-based chat system: the recv () return value is very strange,
because there’s really no outstanding data to receive from the
client. From the nginx logs I see that the only data sent by the
browser is forwarded to the upstream.

Debug log, as well as tcpdump of what actually happens on the wire
and some netstat -An and fstat output may be helpfull (not sure if
netstat/fstat will do the right thing on Darwin though).

See here for some more details:

Maxim D.