Intermittent SSL Handshake Errors

addis_a · January 31, 2015, 7:06pm

Hi,

We are using round-robin DNS to distribute requests to three servers
all running identically configured nginx. Connections then go upstream
to HAProxy and then to our Rails app.

About two weeks ago, users began to experience intermittent SSL
handshake errors. Users reported that these appeared as
“ssl_error_no_cypher_overlap” in the browser. Most of our reports have
come from Firefox users, although we have seen reports from Safari and
stock Android browser users as well. In our nginx error logs, we began
to see consistent errors across all three servers. They started at
around the same time and no recent modifications were made to hardware
or software:

…
2015/01/13 12:22:59 [crit] 11871#0: 140260577 SSL_do_handshake()
failed (SSL: error:1408A0D7:SSL
routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/13 12:23:09 [crit] 11874#0: 140266246 SSL_do_handshake()
failed (SSL: error:1408A0D7:SSL
routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/13 12:23:54 [crit] 11862#0: 140293705 SSL_do_handshake()
failed (SSL: error:1408A0D7:SSL
routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/13 12:23:54 [crit] 11862#0: 140293708 SSL_do_handshake()
failed (SSL: error:1408A0D7:SSL
routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/13 12:25:18 [crit] 11870#0: 140342155 SSL_do_handshake()
failed (SSL: error:1408A0D7:SSL
routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL
handshaking, client: ...*, server: 0.0.0.0:443
…

Suspecting that this may be related to our SSL configuration in nginx
and a recent update to a major browser, I decided to get us up to
date. Previously we were on CentOS5 and could only use an older
version of OpenSSL with the latest security patches. This meant we
could only support TLSv1.0 and a few of the secure recommended
ciphers. After upgrading to CentOS6 and implementing Mozilla’s
recommended configurations for TLSv1.0, TLSv1.1, and TLSv1.2 support,
I am confident that we are following best practices for SSL browser
compatibility and security. Unfortunately this did not fix the issue.
Users began to report a new error in their browser:
“ssl_error_inappropriate_fallback_alert”, and this is currently
reflected in our nginx error logs across all three servers:

…

2015/01/31 03:24:33 [crit] 30658#0: 57298755 SSL_do_handshake()
failed (SSL: error:140A1175:SSL
routines:SSL_BYTES_TO_CIPHER_LIST:inappropriate fallback) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/31 03:24:35 [crit] 30661#0: 57299105 SSL_do_handshake()
failed (SSL: error:140A1175:SSL
routines:SSL_BYTES_TO_CIPHER_LIST:inappropriate fallback) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/31 03:24:41 [crit] 30657#0: 57300774 SSL_do_handshake()
failed (SSL: error:140A1175:SSL
routines:SSL_BYTES_TO_CIPHER_LIST:inappropriate fallback) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/31 03:24:41 [crit] 30657#0: 57300783 SSL_do_handshake()
failed (SSL: error:140A1175:SSL
routines:SSL_BYTES_TO_CIPHER_LIST:inappropriate fallback) while SSL
handshaking, client: ..., server: 0.0.0.0:443
2015/01/31 03:24:41 [crit] 30661#0: 57300785 SSL_do_handshake()
failed (SSL: error:140A1175:SSL
routines:SSL_BYTES_TO_CIPHER_LIST:inappropriate fallback) while SSL
handshaking, client: ...*, server: 0.0.0.0:443
…

Thinking that I had ruled out a faulty SSL stack or nginx
configuration, I focused on monitoring the network connections on
these servers. ESTABLISHED connections are currently at 13k and
TIME_WAIT is at 94k on one server, if that gives any indication to the
type of connections we are dealing with. The other two have very
similar stats. This is typical for peak hours of traffic. I tried
tuning kernel params: lowering tcp_fin_timeout, increasing
tcp_max_syn_backlog, increasing the range of ip_local_port_range,
turning on tcp_tw_reuse, and other popular tuning practices. Nothing
has helped so far and more users continue to contact us about issues
using our site.

I’ve exhausted my ideas and I’m not quite sure what’s gone wrong. I
would be extremely appreciative of any guidance list members could
provide. Below are more technical details about our installation and
configuration of nginx.

nginx -V output:

nginx version: nginx/1.6.2
built by gcc 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx
–conf-path=/etc/nginx/nginx.conf
–error-log-path=/var/log/nginx/error.log
–http-log-path=/var/log/nginx/access.log
–pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock
–http-client-body-temp-path=/var/cache/nginx/client_temp
–http-proxy-temp-path=/var/cache/nginx/proxy_temp
–http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp
–http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp
–http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx
–group=nginx --with-http_ssl_module --with-http_realip_module
–with-http_addition_module --with-http_sub_module
–with-http_dav_module --with-http_flv_module --with-http_mp4_module
–with-http_gunzip_module --with-http_gzip_static_module
–with-http_random_index_module --with-http_secure_link_module
–with-http_stub_status_module --with-http_auth_request_module
–with-mail --with-mail_ssl_module --with-file-aio --with-ipv6
–with-http_spdy_module --with-cc-opt=‘-O2 -g -pipe
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
–param=ssp-buffer-size=4 -m64 -mtune=generic’

nginx config files:

— /etc/nginx/nginx.conf —
user nginx;
worker_processes 12;

error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;

events {
worker_connections 50000;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

log_format with_cookie '$remote_addr - $remote_user [$time_local] ’
'“$request” $status $body_bytes_sent ’
‘“$http_referer” “$http_user_agent”
“$cookie_FL”’;

access_log /var/log/nginx/access.log;

sendfile on;
tcp_nopush on;
tcp_nodelay on;

keepalive_timeout 65;

gzip on;
gzip_http_version 1.0;
gzip_comp_level 2;
gzip_proxied any;
gzip_types text/plain text/html text/css application/x-javascript
text/xml application/xml application/xml+rss text/javascript
application/json;
gzip_vary on;

server_names_hash_bucket_size 64;

set_real_ip_from ...;
real_ip_header X-Forwarded-For;

include /etc/nginx/upstreams.conf;
include /etc/nginx/sites-enabled/*;
}

— /etc/nginx/sites-enabled/fl-ssl.conf —

server {
root /var/www/fl/current/public;

listen 443;
ssl on;
ssl_certificate /etc/nginx/ssl/wildcard.fl.pem;
ssl_certificate_key /etc/nginx/ssl/wildcard.fl.key;
ssl_session_timeout 5m;
ssl_session_cache shared:SSL:50m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers
‘ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA’;
ssl_prefer_server_ciphers on;

server_name **********.com;

access_log /var/log/nginx/fl.ssl.access.log with_cookie;
client_max_body_size 400M;
index index.html index.htm;

if (-f $document_root/system/maintenance.html) {
return 503;
}

Google Analytics

if ($request_filename ~* ga.js$) {
rewrite .* http://www.google-analytics.com/ga.js permanent;
break;
}

if ($request_filename ~* /adgear.js/current/adgear_standard.js) {
rewrite .* http://**********.com/adgear/adgear_standard.js
permanent;
break;
}

if ($request_filename ~* /adgear.js/current/adgear.js) {
rewrite .* http://**********.com/adgear/adgear_standard.js
permanent;
break;
}

if ($request_filename ~* __utm.gif$) {
rewrite .* http://www.google-analytics.com/__utm.gif permanent;
break;
}

if ($host ~* “www”) {
rewrite ^(.)$ http://********.com$1 permanent;
break;
}

location / {
location ~* .(eot|ttf|woff)$ {
add_header Access-Control-Allow-Origin *;
}

if ($request_uri ~* ".(ico|css|js|gif|jpe?g|png)\?[0-9]+$") {
  expires max;
  break;
}

# needed to forward user's IP address to rails
proxy_set_header  X-Real-IP  $remote_addr;

# needed for HTTPS
proxy_set_header  X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Host $http_host;
proxy_set_header  X-FORWARDED_PROTO https;
proxy_redirect off;
proxy_max_temp_file_size 0;

if ($request_uri ~* /polling) {
  proxy_pass http://ssl_polling_upstream;
  break;
}

if ($request_uri = /upload) {
  proxy_pass http://rest_stop_upstream;
  break;
}

if ($request_uri = /crossdomain.xml) {
  proxy_pass http://rest_stop_upstream;
  break;
}

if (-f $request_filename/index.html) {
  rewrite (.*) $1/index.html break;
}

# Rails 3 is for old testing stuff... We don't need this anymore
#if ($http_cookie ~ "rails3=true") {
#  set $request_type '3';
#}

if ($request_uri ~* /polling) {
  set $request_type '${request_type}P';
}


if ($request_type = '3P') {
  proxy_pass http://rails3_upstream;
  break;
}

if ($request_uri ~* /polling) {
  set $request_type '${request_type}P';
}

if ($request_type = '3P') {
  proxy_pass http://rails3_upstream;
  break;
}

if ($request_type = 'P') {
  proxy_pass http://ssl_polling_upstream;
  break;
}

if (!-f $request_filename) {
  set $request_type '${request_type}D';
}

if ($request_type = 'D') {
  proxy_pass http://ssl_fl_upstream;
  break;
}

if ($request_type = '3D') {
  proxy_pass http://rails3_upstream;
  break;
}

}

error_page 500 502 503 504 /50x.html;
location = /50x.html {
root html;
}
}

erubin · January 31, 2015, 8:02pm

…
2015/01/13 12:22:59 [crit] 11871#0: 140260577 SSL_do_handshake()
failed (SSL: error:1408A0D7:SSL
routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL
handshaking, client: ...*, server: 0.0.0.0:443

According to the openssl code, this occurs when a client attempts to
resume
a session that had made use of previously-enabled ciphers. If you’re
changing your allowed ciphers frequently this could be why, otherwise a
full cycle of nginx to empty out the session cache seems like it should
resolve this.

erubin · January 31, 2015, 9:25pm

Hi

Le 31 janv. 2015 20:02, “Richard S.” [email protected] a
écrit :

…
2015/01/13 12:22:59 [crit] 11871#0: 140260577 SSL_do_handshake()
failed (SSL: error:1408A0D7:SSL
routines:SSL3_GET_CLIENT_HELLO:required cipher missing) while SSL
handshaking, client: ...*, server: 0.0.0.0:443

According to the openssl code, this occurs when a client attempts to
resume a session that had made use of previously-enabled ciphers. If
you’re
changing your allowed ciphers frequently this could be why, otherwise a
full cycle of nginx to empty out the session cache seems like it should
resolve this.

Reading Richard reply, maybe the client try to resume the session on a
different server? (If you can check the logs to see where were the
client
before the error)

erubin · February 2, 2015, 8:57pm

Prior to this issue starting, we had not changed our ciphers in several
months. I have tried changing them once since. We have also tried
restarting
nginx several times on each server to clear the cache, but it has not
helped.

Posted at Nginx Forum:

erubin · February 2, 2015, 9:26pm

My first question is do these
I have been fighting a similar issue with SSL handshake issues for the
past
few days. After reboots and upgrades for GHOST, we started seeing errors
like this in our error logs constantly:

*579 SSL_do_handshake() failed (SSL: error:140A1175:SSL
routines:SSL_BYTES_TO_CIPHER_LIST:inappropriate fallback) while SSL
handshaking,

in conjunction with an elevated error rate in client requests to nginx
in
the initial connection phase. I’m not completely sure if the two issues
are
correlated to be honest, I’m still in the troubleshooting process.

I am on a Debian Wheezy system and it started happening with the libssl
package 1.0.1e-2+deb7u13 and continues with u14. As soon as I rolled
back
libssl to u12 and restarted nginx, the logging of errors goes away. I
then
tested ssl to make sure we weren’t vulnerable to POODLE or Heartbleed,
and
it’s all clear. I would recommend trying to go back a few versions in
libssl, restarting nginx and see if that helps, making sure you’re not
leaving yourself open to the major vulnerabilities.

Posted at Nginx Forum:

erubin · February 3, 2015, 8:05pm

Eric,
Did you try to downgrade your libssl to the previous version I mentioned
earlier? Would love to hear if your issues go away.

Posted at Nginx Forum:

erubin · February 3, 2015, 7:19pm

I just finished running an experiment that has shed some light on the
issue.
It has not yet been solved though.

I setup another nginx server with the same configuration with an
upstream
app that always responds with HTTP 200. I included JS on each page load
in
production to make a single request to this server.

I ran tcpdump on the test server and what I found was very interesting.
Client connections producing the above “inappropriate fallback” on the
test
server all appear to do some form of the following:

(Client and Server successfully complete 3-way handshake)
Client: Client Hello TLSv1.2
Server: RST
Client: ACK
Server: RST
(Client and Server successfully complete 3-way handshake)
Client: Client Hello TLSv1.1
Server: RST
Client: ACK
Server: RST
(Client and Server successfully complete 3-way handshake)
Client: Client Hello TLSv1.0
Server: Encrypted Alert (Content Type: Alert (21))
(Client sends RST, which the server acknowledges, and the connection
ends)

I don’t know what the alert is, but I can only assume it’s related to
TLS_FALLBACK_SCSV since the client closes the connection right after.

What’s interesting here is that there is little consistency to these
RSTs.
Sometimes a client downgrades to TSLv1.1 before getting the Encrypted
Alert
(Content Type: Alert(21)). Sometimes a client tries the same version
over
and over again, each time getting an RST from the server, and eventually
gives up. Later many of these IP addresses are observed establishing
successful connections.

Am I correct to assume Nginx is sending these RST packets?

Posted at Nginx Forum:

erubin · February 3, 2015, 9:41pm

(Client and Server successfully complete 3-way handshake)
Client: Client Hello TLSv1.0
Server: Encrypted Alert (Content Type: Alert (21))
(Client sends RST, which the server acknowledges, and the connection ends)

Can you reliably reproduce this with specific client software or
networks? Can
you upload a pcap file this failed handshake somewhere for further
inspection?

erubin · February 4, 2015, 3:43am

The errors went away, and now the only errors I see in our logs relating
to
SSL are handshake timeouts when I turn debug logs on.

Now that I think about it, though, isn’t this to be expected? The errors
immediately went away as soon as I downgraded far enough back to a
version
of OpenSSL that didn’t support TLS_FALLBACK_SCSV. That doesn’t address
why
the connections are getting reset and clients are downgrading in the
first
place, though.

Posted at Nginx Forum:

erubin · February 6, 2015, 7:50pm

We’ve been unable to reproduce it with any one browser or IP address. It
really is very intermittent. Fortunately, I believe we’ve gotten to the
bottom of this. It looks like our data center switched us over to
anti-DDoS
route. This means all of our traffic has been passing through hardware
that
performs heavy packet filtering. The packet loss was causing a lot of
confusion for both server and clients. The TLS version fallback that
some
browsers do upon an unsuccessful handshake made it all the more
confusing,
since these errors get logged as SSL errors in nginx logs.

Posted at Nginx Forum:

erubin · February 4, 2015, 3:48am

You are absolutely correct, but I figured you would want a working
environment while we work with nginx/openssl on figuring out how to fix
this
bug. Knowing that it worked for you also increases my own comfort that
the
issue is mitigated on my side and I won’t have performance issues at my
next
peak time.

Thank you so much for the pcap stuff, I’m sure the information you will
provide to Lukas will be invaluable! Way to lead the charge!

Posted at Nginx Forum:

erubin · February 7, 2015, 12:30am

We’ve been unable to reproduce it with any one browser or IP address. It
really is very intermittent. Fortunately, I believe we’ve gotten to the
bottom of this. It looks like our data center switched us over to anti-DDoS
route. This means all of our traffic has been passing through hardware that
performs heavy packet filtering. The packet loss was causing a lot of
confusion for both server and clients. The TLS version fallback that some
browsers do upon an unsuccessful handshake made it all the more confusing,
since these errors get logged as SSL errors in nginx logs.

So a MITM security device basically did a TLS downgrade attack here,
which
the new fallback extension successfully prevented.

Thats a good thing, it means it works.

erubin · March 20, 2015, 7:17pm

I had to start looking at this issue again now that yet another openssl
security issue. Now that I know I can go back to a working setup just by
downgrading SSL, I am able to gather more information.

This morning, I updated the libssl libraries and restarted nginx, and
the
errors started flooding back. This time, I took a packet capture to see
what
was happening and what I could correlate. I run a set of servers that
handle API requests from a mobile phone application, and every single
client
that produced this error was running iOS.

In the packet capture, we offer the same cipher that the clients always
use
without a problem, but for some reason, some of our iPhone clients have
issues (not all.) I have been unable to discern a pattern, but it’s
always
iPhones and doesn’t seem to have anything to do with the device model or
the
OS version. I haven’t found a single Android instance of the IP’s that
show
up in our error logs, and we have slightly more Android devices than iOS
devices.

We get the Client Hello which has a list of 37 potential ciphers for TLS
1.2. We send the server hello and offer the normal cipher. The client,
instead of continuing on, immediately sends a FIN, ACK. It then tries to
connect again over TLS 1.0, gives the client hello, we send the ACK and
almost immediately, WE send a FIN, ACK to the client.

Since it’s an API and there are multiple requests being made from the
client, not every one will fail. Some negotiate SSL just fine, others do
not.

I’m still digging through the packet captures to try and figure out any
other patterns.

As soon as I downgrade libssl, everything works fine.

Posted at Nginx Forum:

erubin · March 21, 2015, 3:54pm

Hello!

On Fri, Mar 20, 2015 at 02:15:42PM -0400, tempspace wrote:

In the packet capture, we offer the same cipher that the clients always use
connect again over TLS 1.0, gives the client hello, we send the ACK and
almost immediately, WE send a FIN, ACK to the client.

So it looks like th fallback prevention part looks like it should -
the inappropriate fallback is prevented. The question now is why
fallback happens at all, that is - why the client sends a FIN. It
might be some specific cipher which causes the problem - you may
try switching ssl_prefer_server_ciphers to off (the default) to
see if it helps, and/or playing with ciphers supported (again,
default will be a good starting point).

–
Maxim D.
http://nginx.org/

erubin · March 20, 2015, 6:58pm

I am seeing similar error as well. It is showing up for lot of people
and am
not sure why it is happening and if actually the clients facing the
error
are actually able to browse through the website or not. Can someone
please
help me understanding that is it safe to downgrade to the earlier
version of
libssl? and does it solve the problem of client unable to connect (if
that
happens) in this case?

Posted at Nginx Forum:

erubin · March 21, 2015, 4:52pm

Maxim,
I have been playing with the ciphers as well, and it doesn’t appear to
be
cipher related. It happens for every cipher I’ve tried. I tried with
turning
off the prefer on the server, and it uses the same cipher with the
prefer
on. I then turned prefer server ciphers back on, and tailed our access
logs
which show which cipher was used for the communication. I then went
through
cipher by cipher, disabled the cipher in our config and restarted nginx
each
time. None of them had any difference, we’re still seeing lots of
fallbacks
exclusively from our iOS clients.

I tried the following ciphers to no avail:

ECDHE-RSA-AES256-SHA384
ECDHE-RSA-AES128-SHA256
ECDHE-RSA-AES256-SHA
ECDHE-RSA-AES128-SHA
DHE-RSA-AES256-SHA256
DHE-RSA-AES256-SHA
DHE-RSA-AES128-SHA256
DHE-RSA-AES128-SHA

Posted at Nginx Forum:

erubin · March 21, 2015, 5:00pm

I should specify that I agree with what is happening. We have clients
that
are falling back under normal conditions, and the latest libssl that
implemented fallback prevention for TLS is stopping. I have downgraded
our
libssl and I’m looking in my logs, and I see plenty of iOS 8 devices
that
auto-negotiate to TLS 1.2 that end up with a TLS 1.0 session. When the
new
libssl is installed, these connections get blocked.

Is there a way to turn off the fallback prevention for TLS on the server
side while we try to figure out what’s happening?

Posted at Nginx Forum:

erubin · March 22, 2015, 2:14am

Hello!

On Sat, Mar 21, 2015 at 11:59:17AM -0400, tempspace wrote:

I should specify that I agree with what is happening. We have clients that
are falling back under normal conditions, and the latest libssl that
implemented fallback prevention for TLS is stopping. I have downgraded our
libssl and I’m looking in my logs, and I see plenty of iOS 8 devices that
auto-negotiate to TLS 1.2 that end up with a TLS 1.0 session. When the new
libssl is installed, these connections get blocked.

Is there a way to turn off the fallback prevention for TLS on the server
side while we try to figure out what’s happening?

Looking though OpenSSL code - I don’t think it’s possible without
OpenSSL code changes. Changes will be trivial though.

–
Maxim D.
http://nginx.org/

erubin · March 26, 2015, 7:43pm

That surely helps. So as of now the only way to resolve the issue is
going
back to u12 version of libssl?

Posted at Nginx Forum:

erubin · May 8, 2015, 4:50pm

First off, thanks to all who contributed to this thread. I must admit I
did
not understand much of it, however as someone plagued by this bug (we
have a
bunch of cherrypy REST servers talking to iOS and Android clients and
have
seen a lot of those fallback errors), I must admit I’m a bit of a loss
on
how to proceed here with regards to the future.

Yes, I have downgraded my libssl to deb7u12, however I wonder if the
openssl
team or debian or anyone capable of fixing this issue for good in future
openssl releases is aware of what we found here. How to proceed?
Especially
in light of a new debian release (not sure whether I can downgrade to
deb7u12 on jessie…).

Best regards,

Michael Lauer.

Posted at Nginx Forum: