SSL negotiation failures causing 408 error to be returned

While debugging for a client, we found that we’re seeing significant
numbers of 408s being generated for SSL connections. It seems to be
MOSTLY Chrome on a Windows 7 base with some MSIE 9.0 (also on Win 7) as
user agents go, and at this point I’m completely stumped. We’ve tried
disabling session cache, upping timeout values, enabling/disabling
keepalives, nothing seems to stem the steady flow of these. Below is a
sample debug log (I’ve redacted specific pieces for business reasons).
What I see a lot of is 6 simultaneous connections being setup, 2 being
used, and then 4 timing out 60s later. Invariably, if I enable full
debug, I see ‘client timed out (60: Operation timed out) while SSL
handshaking’ which makes no sense. If I turn on logging for session IDs,
I can clearly see the clients being sent 408s are being assigned session
IDs which says to me they’re completing the SSL negotiation and then
just… sitting idle.

2012/04/14 14:51:41 [debug] 99433#0: *3313 accept: fd:100
2012/04/14 14:51:41 [debug] 99433#0: *3313 post event 0000000801A026F0
2012/04/14 14:51:41 [debug] 99433#0: *3313 delete posted event
0000000801A026F0
2012/04/14 14:51:41 [debug] 99433#0: *3313 malloc:
0000000801EEF000:1256
2012/04/14 14:51:41 [debug] 99433#0: *3313 posix_memalign:
0000000801EE8300:256 @16
2012/04/14 14:51:41 [debug] 99433#0: *3313 malloc:
0000000800DE5000:4096
2012/04/14 14:51:41 [debug] 99433#0: *3313 posix_memalign:
0000000800DFE000:4096 @16
2012/04/14 14:51:41 [debug] 99433#0: *3313 http check ssl handshake
2012/04/14 14:51:41 [debug] 99433#0: *3313 https ssl handshake: 0x16
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL server name: “”
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL_do_handshake: -1
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL_get_error: 2
2012/04/14 14:51:41 [debug] 99433#0: *3313 kevent set event: 100: ft:-1
fl:0025
2012/04/14 14:51:41 [debug] 99433#0: *3313 event timer add: 100:
60000:1334440361120
2012/04/14 14:51:41 [debug] 99433#0: *3313 post event 0000000801A026F0
2012/04/14 14:51:41 [debug] 99433#0: *3313 delete posted event
0000000801A026F0
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL handshake handler: 0
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL_do_handshake: 1
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL: TLSv1, cipher: “RC4-SHA
SSLv3 Kx=RSA Au=RSA Enc=RC4(128) Mac=SHA1”
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL reused session
2012/04/14 14:51:41 [debug] 99433#0: *3313 http process request line
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL_read: -1
2012/04/14 14:51:41 [debug] 99433#0: *3313 SSL_get_error: 2
2012/04/14 14:52:41 [debug] 99433#0: *3313 event timer del: 100:
1334440361120
2012/04/14 14:52:41 [debug] 99433#0: *3313 http process request line
2012/04/14 14:52:41 [info] 99433#0: *3313 client timed out (60:
Operation timed out) while SSL handshaking, client: ,
server:
2012/04/14 14:52:41 [debug] 99433#0: *3313 http request count:1 blk:0
2012/04/14 14:52:41 [debug] 99433#0: *3313 http close request
2012/04/14 14:52:41 [debug] 99433#0: *3313 http log handler
2012/04/14 14:52:41 [debug] 99433#0: *3313 free: 0000000800DFE000,
unused: 2426
2012/04/14 14:52:41 [debug] 99433#0: *3313 close http connection: 100
2012/04/14 14:52:41 [debug] 99433#0: *3313 SSL_shutdown: 1
2012/04/14 14:52:41 [debug] 99433#0: *3313 reusable connection: 0
2012/04/14 14:52:41 [debug] 99433#0: *3313 free: 0000000800DE5000
2012/04/14 14:52:41 [debug] 99433#0: *3313 free: 0000000801EEF000
2012/04/14 14:52:41 [debug] 99433#0: *3313 free: 0000000801EE8200,
unused: 8
2012/04/14 14:52:41 [debug] 99433#0: *3313 free: 0000000801EE8300,
unused: 72

The following 5 UAs are by far the most common

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like
Gecko) Chrome/18.0.1025.152 Safari/535.19
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko)
Chrome/18.0.1025.162 Safari/535.19
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko)
Chrome/18.0.1025.162 Safari/535.19
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like
Gecko) Chrome/18.0.1025.162 Safari/535.19
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

This appears to implicate Windows in some way, but I’ve yet to figure it
out and am, quite frankly, stumped.

Posted at Nginx Forum:

I should note, I’ve also tried various revisions of nginx, the debug was
taken using 1.1.19 w/ the following configure options

–with-http_ssl_module --with-http_stub_status_module --with-ipv6
–with-pcre --with-debug --with-openssl=/usr/local/src/openssl-1.0.1

Posted at Nginx Forum:

Hello!

On Sun, Apr 15, 2012 at 11:34:48PM -0400, bpiraeus wrote:

handshaking’ which makes no sense. If I turn on logging for session IDs,
I can clearly see the clients being sent 408s are being assigned session
IDs which says to me they’re completing the SSL negotiation and then
just… sitting idle.

This is because Chrome opens many connections to a server from
very start, trying to reduce latence in case of these connections
will be useful. If they aren’t - they are just kept idle and
either closed by Chrome (in this case you’ll likely see 400 in
logs) or timed out (in this case you’ll see 408).

[…]

2012/04/14 14:52:41 [debug] 99433#0: *3313 http process request line
2012/04/14 14:52:41 [info] 99433#0: *3313 client timed out (60:
Operation timed out) while SSL handshaking, client: ,
server:

The “while SSL handshaking” part of the message is a bit
misleading. Probably something like this will make things a bit
better:

— a/src/http/ngx_http_request.c
+++ b/src/http/ngx_http_request.c
@@ -612,6 +612,8 @@ ngx_http_ssl_handshake_handler(ngx_conne

     c->ssl->no_wait_shutdown = 1;
  •    c->log->action = "reading client request line";
    
  •    c->read->handler = ngx_http_process_request_line;
       /* STUB: epoll edge */ c->write->handler = 
    

ngx_http_empty_handler;

Maxim D.

The problem with that diagnosis, is that in this case, it’s actually
causing failures on the client side. The client in question has a flash
game which pulls some amount of data over https, when these sessions
time out and hand out a 408, it causes the game itself to barf up a
lung. Additionally after pulling 12 hours of logs and doing UA matching,
while Chrome on Win7 base seems to be the major player, we find these as
well:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/5.0; SLCC1;
.NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; yie9)
Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; WOW64; Trident/5.0;
yie9)
Mozilla/5.0 (Windows NT 6.0; rv:11.0) Gecko/20100101 Firefox/11.0
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0;
SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media
Center PC 6.0; CPNTDF; .NET4.0C)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; GTB6; SLCC1; .NET CLR
2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618;
WinNT-PAI 26.09.2009; .NET4.0C)
Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0;
NP06)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0;
SlimBrowser)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Trident/5.0; SLCC1;
.NET CLR 2.0.50727; Media Center PC 5.0; MS-RTC EA 2; .NET CLR
3.5.30729; .NET CLR 3.0.30729; .NET4.0C; BRI/2)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64;
Trident/5.0)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:11.0) Gecko/20100101
Firefox/11.0
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_4_11; en)
AppleWebKit/533.19.4 (KHTML, like Gecko) Version/4.1.3 Safari/533.19.4
Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0;
FunWebProducts)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0;
yie9)
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101
Firefox/11.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/534.50.2
(KHTML, like Gecko) Version/5.0.6 Safari/533.22.3
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; WOW64; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

The interesting bit is that the significant amount of these (when we
strip out Chrome) are MSIE on a Win7 base, which makes me inclined to
suspect something funny with Win 7, but there’s at least 2 instances of
OS X in there, and 2 instances (FF 10.0.02 and FF 11) on NT 5.1 (XP
base)

If we throw out the OS X and XP ones as outliers due to something else
(people monkeying with settings, who knows), that leaves us with a
pretty clear indication that -something- in Win 7 has got some kind of
naughty behavior with SSL, but I’ll be damned if I can come up with
anything. I spent half of last night trawling google for Win 7+SSL
issues and just didn’t find anything of relevance.

Posted at Nginx Forum:

I stumbled across this this morning after talking with some co-workers
about what we’re seeing

http://code.google.com/p/chromium/issues/detail?id=85229

I’m wondering if someone @ MS decided IE9 would benefit from the same
speculative pre-connect sockets. If we strip out anything that’s not
IE9/Chrome we’re left with a pretty small subset of ‘other browsers’
which I’m leaning towards considering outliers just by virtue of being <
.5% of the overall error count.

Posted at Nginx Forum:

Maxim D. Wrote:

Hello!

Similar errors might as well be a result of e.g.
network connectivity
problems. It’s hard to say anything without
detailed inspection
of a failure on the client side.

We’ve been able to replicate once in a VM running Win 7 w/ Chrome.

There is at least several Vista user-agents in
your list (Windows
NT 6.0), and I would rather suggest to compare the
distribution to
your site’s OS/UA stats before concluding
anything. And you need
to trace real client failures, not 408 errors in
general.

We’ve done numerous packet traces, at this point we’re fairly well
stumped and are leaning towards it being some strange interaction with
SSL+Flash. The 408s, while not specifically problem, may indirectly
contribute to some behavioral issue that we’re still trying to track
down. I would be hard pressed to put this on Nginx, I was, ultimately,
hoping someone had perhaps seen this behvior previously and had some
insight into the issue.

Posted at Nginx Forum:

Hello!

On Mon, Apr 16, 2012 at 10:43:03AM -0400, bpiraeus wrote:

The problem with that diagnosis, is that in this case, it’s actually
causing failures on the client side. The client in question has a flash
game which pulls some amount of data over https, when these sessions
time out and hand out a 408, it causes the game itself to barf up a
lung. Additionally after pulling 12 hours of logs and doing UA matching,
while Chrome on Win7 base seems to be the major player, we find these as
well:

Similar errors might as well be a result of e.g. network connectivity
problems. It’s hard to say anything without detailed inspection
of a failure on the client side.

[…]

The interesting bit is that the significant amount of these (when we
strip out Chrome) are MSIE on a Win7 base, which makes me inclined to
suspect something funny with Win 7, but there’s at least 2 instances of
OS X in there, and 2 instances (FF 10.0.02 and FF 11) on NT 5.1 (XP
base)

There is at least several Vista user-agents in your list (Windows
NT 6.0), and I would rather suggest to compare the distribution to
your site’s OS/UA stats before concluding anything. And you need
to trace real client failures, not 408 errors in general.

Maxim D.