Websocket tunnel broken with existing SSL session

musicdenotation · January 14, 2014, 10:56pm

We’ve been debugging this issue for 3 days now and even though we have a
temporary fix, we’re still puzzled about it.

There is an iOS app, which opens a websocket connection to our server
over
SSL. Our server runs SmartOS and has nginx 1.5.0 (also happens on 1.4.1)
proxying to a backend server running in NodeJS.

To reproduce, I start my app, a websocket connection is established and
works well, then I put the app to sleep for awhile until nginx kills the
connection. When I reopen the app, the following happens:

App notices that the connection is dead and reconnects.
Behind the scenes, iOS reuses the SSL session from before and quickly
opens a new socket.
A HTTP upgrade request and response flow across with no problems.
With a successful web-socket established on both sides, the client
starts sending frames. However, none of these gets delivered to the
backend
server.
After a minute, nginx kills the connection even though the client is
sending periodic pings.
Back to 1.

I haven’t managed to reduce the test case or reproduce it in another
environment yet. This only happens when using SSL. In wireshark I see
the
websocket frames being sent from the iPhone client and TCP acked
properly.

What currently fixes the problem is to disable SSL session reuse in
nginx.
Then every websocket connection works like it should.

Here is the config before the fix:

server {
    ### Server port and name ###
    listen          80 default_server;
    listen          443 default_server ssl;
    server_name     test.mydomain.com;

    ### SSL cert files ###
    ssl_certificate

/opt/local/etc/nginx/ssl/certificate.crt;
ssl_certificate_key
/opt/local/etc/nginx/ssl/certificate.key;

    ### SSL specific settings ###
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;
    ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers RC4:HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    keepalive_timeout 60;
    client_max_body_size 10m;

    location / {
        access_log off;
        proxy_pass http://localhost:3003;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # WebSocket support (nginx 1.4)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Best regards,
Eirikur Nilsson

EirS_kur_Nilsson · January 14, 2014, 11:54pm

Hello!

On Tue, Jan 14, 2014 at 04:22:44PM +0000, Eiríkur Nilsson wrote:

    ### Server port and name ###

[…]

Which event method is used? If eventport, try switching to
/dev/poll instead (which is expected to be used by default on
SmartOS and other Solaris variants), it should fix the issue. The
eventport event method is known to have problems when proxying and
this may cause symptoms you see, it needs attention.

–
Maxim D.
http://nginx.org/

EirS_kur_Nilsson · January 15, 2014, 1:17pm

Thanks for the suggestion.

We’re not specifying a event method currently, though I can’t see if
eventport was the default. If I explicitly use /dev/poll, and turn
ssl_session_cache back on, the issue comes back.

I have verified that both the HTTP request and response are proxied
properly. It seems to me that when the upgrade is finished nginx enters
direct tunneling mode for the websocket data, which doesn’t work for
some
sockets, at least these recovered SSL sessions from iOS clients.

The event method issue would have explained why I can’t reproduce the
issue
on mac (with self signed cert). I also haven’t reproduced it with an
Android client, although I did not verify with tcpdump if my android
test
reused the SSL session.

Any other ideas?

Eirikur

EirS_kur_Nilsson · January 15, 2014, 5:16pm

Hello!

On Wed, Jan 15, 2014 at 12:16:18PM +0000, Eiríkur Nilsson wrote:

The event method issue would have explained why I can’t reproduce the issue
on mac (with self signed cert). I also haven’t reproduced it with an
Android client, although I did not verify with tcpdump if my android test
reused the SSL session.

Any other ideas?

It might be helpfull to see debug log and a tcpdump. See also
Debugging | NGINX for hints.

–
Maxim D.
http://nginx.org/

EirS_kur_Nilsson · January 15, 2014, 8:31pm

Hey!

On Wed, Jan 15, 2014 at 4:15 PM, Maxim D. [email protected]
wrote:

It might be helpfull to see debug log and a tcpdump. See also
Debugging | NGINX for hints.

Debug log: error
tcpdump: capture (only contains traffic between iOS
and
nginx)

This reproduction has nginx 1.5.8 running with SSL on port 4443 using
/dev/poll. It’s running on a new-ish smartmachine instance from Joyent.
it
gets two websocket connections:

The first at 18:43:52 is a new SSL session and works correctly, with
traffic visible in tcpdump and debug log.
Second at 18:45:30 reuses the other SSL session but doesn’t work,
traffic
can be seen in tcpdump but not in debug log.

I find it very weird that there isn’t a single debug message from nginx
after it switches protocols for the second connection until I stop the
nginx after the test is finished. I’m no closer.

Thanks so much for your interest.

Eirikur

EirS_kur_Nilsson · January 17, 2014, 1:16pm

Hey Maxim

On Thu, Jan 16, 2014 at 10:01 PM, Maxim D. [email protected]
wrote:

Please try the following patch:

Wow, everything seems to work correctly with this patch, session reuse
and
everything. I thought this patch would close the new connection, causing
a
fresh reconnect, but I don’t see that happening in the capture. Is nginx
holding on to - and trying to use - a handle to the old connection,
which
needs cleanup?

Do you think this patch might get applied into the mainline?

Best regards,
Eirikur

EirS_kur_Nilsson · January 16, 2014, 11:01pm

Hello!

On Wed, Jan 15, 2014 at 07:30:20PM +0000, Eiríkur Nilsson wrote:

tcpdump: capture (only contains traffic between iOS and

I find it very weird that there isn’t a single debug message from nginx
after it switches protocols for the second connection until I stop the
nginx after the test is finished. I’m no closer.

Please try the following patch:

— a/src/http/ngx_http_upstream.c
+++ b/src/http/ngx_http_upstream.c
@@ -2565,6 +2565,16 @@ ngx_http_upstream_upgrade(ngx_http_reque
{
ngx_http_upstream_process_upgraded(r, 0, 1);
}
+

if (ngx_handle_read_event(c->read, 0) != NGX_OK) {

   ngx_http_upstream_finalize_request(r, u, NGX_ERROR);

```
   return;
```
}
if (ngx_handle_read_event(u->peer.connection->read, 0) != NGX_OK) {

   ngx_http_upstream_finalize_request(r, u, NGX_ERROR);

```
   return;
```
}
}

–
Maxim D.
http://nginx.org/

EirS_kur_Nilsson · January 17, 2014, 6:02pm

Hello!

On Fri, Jan 17, 2014 at 12:15:25PM +0000, Eiríkur Nilsson wrote:

fresh reconnect, but I don’t see that happening in the capture. Is nginx
holding on to - and trying to use - a handle to the old connection, which
needs cleanup?

The problem is that read event was removed after ssl handshake,
and never added back - so new data on client connection were not
reported by the kernel. The ngx_handle_read_event() call ensures
that an appropriate event is added again.

Do you think this patch might get applied into the mainline?

Yes, after some more testing and review.

–
Maxim D.
http://nginx.org/

EirS_kur_Nilsson · January 17, 2014, 6:24pm

On Fri, Jan 17, 2014 at 5:01 PM, Maxim D. [email protected]
wrote:

The problem is that read event was removed after ssl handshake,
and never added back - so new data on client connection were not
reported by the kernel. The ngx_handle_read_event() call ensures
that an appropriate event is added again.

Do you think this patch might get applied into the mainline?

Yes, after some more testing and review.

Ahh, that makes sense. Silly me and my poor C reading skills. I didn’t
even
look at ngx_handle_read_event since I’m not used to seeing side-effects
in
a conditional statement, but I guess that’s a standard pattern in C to
handle errors.

I only wonder why I couldn’t see anyone else experiencing this, with me
reproducing it on so many versions with a basic configuration. I guess
this
only happens on Solaris?

Thank you so much for your help.