Optimizing NGINX TLS Time To First Byte (TTTFB)

musicdenotation · December 17, 2013, 6:06am

FYI:

We started with a ~1800ms overhead for our TLS connection (nearly 5
extra
RTTs); eliminated the extra certificate roundtrip after a nginx upgrade;
cut another RTT by forcing a smaller record size; dropped an extra RTT
from
the TLS handshake thanks to TLS False Start. With all said and done,
our
TTTFB is down to ~1560ms, which is exactly one roundtrip higher than a
regular HTTP connection. Now we’re talking!

archfiend997 · December 17, 2013, 9:46am

Hi Adam,

FYI: Optimizing NGINX TLS Time To First Byte (TTTFB) - igvita.com

We started with a ~1800ms overhead for our TLS connection (nearly 5
extra RTTs); eliminated the extra certificate roundtrip after a nginx
upgrade; cut another RTT by forcing a smaller record size; dropped an
extra RTT from the TLS handshake thanks to TLS False Start. With all
said and done, our TTTFB is down to ~1560ms, which is exactly one
roundtrip higher than a regular HTTP connection. Now we’re talking!

Thanks, this is very helpful. Are you trying to upstream the record size
patch?

What I don’t get from your patch, it seems like you are hardcoding the
buffer to 16384 bytes during handshake (line 570) and only later use a
1400 byte buffer (via NGX_SSL_BUFSIZE).

Am I misunderstanding the patch/code?

Thanks,

Lukas

archfiend997 · December 17, 2013, 10:49am

On 17 December 2013 08:46, Lukas T. [email protected] wrote:

Hi Adam,

Thanks, this is very helpful. Are you trying to upstream the record size
patch?

What I don’t get from your patch, it seems like you are hardcoding the
buffer to 16384 bytes during handshake (line 570) and only later use a
1400 byte buffer (via NGX_SSL_BUFSIZE).

Am I misunderstanding the patch/code?

I don’t think Adam wrote the article or patch; Ilya G. did.

J

archfiend997 · December 17, 2013, 10:53am

Hi,

Am I misunderstanding the patch/code?

I don’t think Adam wrote the article or patch; Ilya G. did.

J

Ups, right. Looping Ilya, perhaps he can comment.

Thanks,

Lukas

archfiend997 · December 18, 2013, 6:07pm

set maximum record size for application data to 1400 bytes. [1]
The initial buffer size is DEFAULT_BUFFER_SIZE, currently 4096. Any
and
thats the improvement you see in your measurement?

Looking at the tcpdump after applying the patch does show ~1400 byte
records:
http://cloudshark.org/captures/714cf2e0ca10?filter=tcp.stream%3D%3D2

Although now on closer inspection there seems to be another gotcha in
there
that I overlooked: it’s emitting two packets, one is 1389 bytes, and
second
is ~31 extra bytes, which means the actual record is 1429 bytes.
Obviously,
this should be a single packet… and 1400 bytes.

latency at the beginning when TCP cwnd is low, and then decrease the
a good compromise and ‘upstream-able’.

If you only distinguish pre and post TLS handshake then you’ll still
(likely) incur the extra RTT on first app-data record – that’s what
we’re
trying to avoid by reducing the default record size. For HTTP traffic, I
think you want 1400 bytes records. Once we’re out of slow-start, you can
switch back to larger record size.

archfiend997 · December 18, 2013, 2:22pm

Hi!

allow the handshake to set/use the maximum 16KB bufsize to avoid extra
RTTs during tunnel negotiation.

Ok, what I read from the patch and your intent is the same

I was confused about the 16KB bufsize for the initial negotiation, but
now
I’ve read the bug report [1] and the patch [2] about the extra RTT when
using long certificate chains, and I understand it.

But I don’t really get the openssl documentation about this [3]:

The initial buffer size is DEFAULT_BUFFER_SIZE, currently 4096. Any attempt
to reduce the buffer size below DEFAULT_BUFFER_SIZE is ignored.

In other words this would mean we cannot set the buffer size below 4096,
but
you are doing exactly this, by setting the buffer size to 1400 byte.
Also,
you measurements indicate success, so it looks like this statement in
the
openssl documentation is wrong?

Or does setting the buffer size to 1400 “just” reset it from 16KB to 4KB
and
thats the improvement you see in your measurement?

P.S. (b) would be much better, even if takes a bit more work.

Well, I’m not sure (b) its so easy, nginx would need to understand
whether
there is bulk or interactive traffic. Such heuristics may backfire in
more
complex scenarios.

But setting an optimal buffer size for pre- and post-handshake seems to
be
a good compromise and ‘upstream-able’.

I suspect that haproxy suffers from the same problem with an extra RTT
when
using a small tune.ssl.maxrecord value. I will see if I can reproduce
this.

Thanks for clarifying,

Lukas

[1] #413 (Extra roundtrip during SSL handshake with long certificate chains) – nginx
[2] http://hg.nginx.org/nginx/rev/a720f0b0e083
[3] /docs/manmaster/man3/BIO_f_buffer.html

archfiend997 · December 18, 2013, 10:03pm

Looking at the tcpdump after applying the patch does show ~1400 byte records:
CS Enterprise on cloudshark.org [2]

Although now on closer inspection there seems to be another gotcha in there that
I overlooked: it’s emitting two packets, one is 1389 bytes, and second is ~31
extra bytes, which means the actual record is 1429 bytes. Obviously, this should
be a single packet… and 1400 bytes.

I did some empirical testing and with my configuration (given cipher
size, padding, and all), I came to 1370 bytes as being the optimal size
for avoid fragmenting of TLS record fragmentation.

If you only distinguish pre and post TLS handshake then you’ll still (likely)
incur the extra RTT on first app-data record – that’s what we’re trying to avoid
by reducing the default record size. For HTTP traffic, I think you want 1400 bytes
records. Once we’re out of slow-start, you can switch back to larger record size.

Maybe I am wrong but I was of the belief that you should always try to
fit TLS records into individual TCP segments. Hence you should always
try to keep TLS record ~1400 bytes (or 1370 in my case), no matter the
TCP Window.

archfiend997 · December 18, 2013, 6:39pm

Hello!

On Tue, Dec 17, 2013 at 04:03:27PM -0800, Ilya G. wrote:

[…]

Although now on closer inspection there seems to be another gotcha in there
that I overlooked: it’s emitting two packets, one is 1389 bytes, and second
is ~31 extra bytes, which means the actual record is 1429 bytes. Obviously,
this should be a single packet… and 1400 bytes.

We’ve discussed this alot here a while ago, and it turns
out that it’s very non-trivial task to fill exactly one packet -
as space in packets may vary depending on TCP options used, MTU,
tunnels used on a way to a client, etc.

On the other hand, it looks good enough to have records up to
initial CWND in size without any significant latency changes. And
with IW10 this basically means that anything up to about 14k
should be fine (with RFC3390, something like 4k should be ok).
It also reduces bandwidth costs associated with using multiple
records.

Just in case, below is a patch to play with SSL buffer size:

HG changeset patch

User Maxim D. [email protected]

Date 1387302972 -14400

Tue Dec 17 21:56:12 2013 +0400

Node ID 090a57a2a599049152e87693369b6921efcd6bca

Parent e7d1a00f06731d7508ec120c1ac91c337d15c669

SSL: ssl_buffer_size directive.

diff --git a/src/event/ngx_event_openssl.c
b/src/event/ngx_event_openssl.c
— a/src/event/ngx_event_openssl.c
+++ b/src/event/ngx_event_openssl.c
@@ -190,6 +190,8 @@ ngx_ssl_create(ngx_ssl_t *ssl, ngx_uint_
return NGX_ERROR;
}

ssl->buffer_size = NGX_SSL_BUFSIZE;
/* client side options */

SSL_CTX_set_options(ssl->ctx, SSL_OP_MICROSOFT_SESS_ID_BUG);
@@ -726,6 +728,7 @@ ngx_ssl_create_connection(ngx_ssl_t *ssl
}

sc->buffer = ((flags & NGX_SSL_BUFFER) != 0);
sc->buffer_size = ssl->buffer_size;

sc->connection = SSL_new(ssl->ctx);

@@ -1222,7 +1225,7 @@ ngx_ssl_send_chain(ngx_connection_t *c,
buf = c->ssl->buf;

 if (buf == NULL) {

   buf = ngx_create_temp_buf(c->pool, NGX_SSL_BUFSIZE);

   buf = ngx_create_temp_buf(c->pool, c->ssl->buffer_size);
   if (buf == NULL) {
       return NGX_CHAIN_ERROR;
   }

@@ -1231,14 +1234,14 @@ ngx_ssl_send_chain(ngx_connection_t *c,
}

 if (buf->start == NULL) {

   buf->start = ngx_palloc(c->pool, NGX_SSL_BUFSIZE);

   buf->start = ngx_palloc(c->pool, c->ssl->buffer_size);
   if (buf->start == NULL) {
       return NGX_CHAIN_ERROR;
   }

   buf->pos = buf->start;
   buf->last = buf->start;

   buf->end = buf->start + NGX_SSL_BUFSIZE;

```
   buf->end = buf->start + c->ssl->buffer_size;
```
}

send = buf->last - buf->pos;
diff --git a/src/event/ngx_event_openssl.h
b/src/event/ngx_event_openssl.h
— a/src/event/ngx_event_openssl.h
+++ b/src/event/ngx_event_openssl.h
@@ -29,6 +29,7 @@
typedef struct {
SSL_CTX *ctx;
ngx_log_t *log;
size_t buffer_size;
} ngx_ssl_t;

@@ -37,6 +38,7 @@ typedef struct {

 ngx_int_t                   last;
 ngx_buf_t                  *buf;

size_t buffer_size;

ngx_connection_handler_pt handler;

diff --git a/src/http/modules/ngx_http_ssl_module.c
b/src/http/modules/ngx_http_ssl_module.c
— a/src/http/modules/ngx_http_ssl_module.c
+++ b/src/http/modules/ngx_http_ssl_module.c
@@ -111,6 +111,13 @@ static ngx_command_t ngx_http_ssl_comma
offsetof(ngx_http_ssl_srv_conf_t, ciphers),
NULL },

{ ngx_string(“ssl_buffer_size”),

 NGX_HTTP_MAIN_CONF|NGX_HTTP_SRV_CONF|NGX_CONF_TAKE1,

```
 ngx_conf_set_size_slot,
```
```
 NGX_HTTP_SRV_CONF_OFFSET,
```

 offsetof(ngx_http_ssl_srv_conf_t, buffer_size),

```
 NULL },
```
{ ngx_string(“ssl_verify_client”),
NGX_HTTP_MAIN_CONF|NGX_HTTP_SRV_CONF|NGX_CONF_TAKE1,
ngx_conf_set_enum_slot,
@@ -424,6 +431,7 @@ ngx_http_ssl_create_srv_conf(ngx_conf_t

sscf->enable = NGX_CONF_UNSET;
sscf->prefer_server_ciphers = NGX_CONF_UNSET;
sscf->buffer_size = NGX_CONF_UNSET_SIZE;
sscf->verify = NGX_CONF_UNSET_UINT;
sscf->verify_depth = NGX_CONF_UNSET_UINT;
sscf->builtin_session_cache = NGX_CONF_UNSET;
@@ -465,6 +473,9 @@ ngx_http_ssl_merge_srv_conf(ngx_conf_t *
(NGX_CONF_BITMASK_SET|NGX_SSL_SSLv3|NGX_SSL_TLSv1
|NGX_SSL_TLSv1_1|NGX_SSL_TLSv1_2));
ngx_conf_merge_size_value(conf->buffer_size, prev->buffer_size,
```
                    NGX_SSL_BUFSIZE);
```
ngx_conf_merge_uint_value(conf->verify, prev->verify, 0);
ngx_conf_merge_uint_value(conf->verify_depth, prev->verify_depth,
1);

@@ -572,6 +583,8 @@ ngx_http_ssl_merge_srv_conf(ngx_conf_t *
return NGX_CONF_ERROR;
}

conf->ssl.buffer_size = conf->buffer_size;

if (conf->verify) {

   if (conf->client_certificate.len == 0 && conf->verify != 3) {

diff --git a/src/http/modules/ngx_http_ssl_module.h
b/src/http/modules/ngx_http_ssl_module.h
— a/src/http/modules/ngx_http_ssl_module.h
+++ b/src/http/modules/ngx_http_ssl_module.h
@@ -26,6 +26,8 @@ typedef struct {
ngx_uint_t verify;
ngx_uint_t verify_depth;

size_t buffer_size;
ssize_t builtin_session_cache;

time_t session_timeout;

–
Maxim D.
http://nginx.org/

archfiend997 · December 19, 2013, 1:51am

On 2013-12-19 01:04, Ilya G. wrote:

…and we’re looking at ~1360 bytes… Which is close to what you’re seeing in
your testing.

Yes, and I haven’t employed IPv6 yet; hence I could save 20 bytes.

and minimizes impact of packet reordering and packet loss.

I remember reading (I believe it was in your (excellent) book! ;)) that
upon packet loss, the full TLS record has to be retransmitted. Not cool
if the TLS record is large and fragmented. So that’s indeed a good
reason to keep TLS records small and preferably within the size of a TCP
segment.

FWIW, for these exact reasons the Google frontend servers have been using TLS
record = TCP segment for a few years now… So there is good precedent to using
this as a default.

Yeah, about that. Google’s implementation looks very nice. I keep
looking at it in Wireshark and wonder if there is a way that I could
replicate their implementation with my limited knowledge. It probably
requires tuning of the underlying application as well? Google uses a
1470 bytes frame size (14 bytes header plus 1456 bytes payload), with
the TLS record fixed at ~ 1411 bytes. Not sure if a MTU 1470 / MSS 1430
is any beneficial for TLS communication.

They optimized the stack to almost always exactly fit a TLS record
into the available space of a TCP segment. If I look at one of my sites,
https://www.zeitgeist.se, with standard MTU/MSS, and the TLS record size
fixed to 1370 bytes + overhead, Nginx would happily use the remaining
space in the TCP record and add part of a second TLS record to it, of
which the rest then fragments into a second TCP segment. I played around
with TCP_CORK (tcp_nopush), but it didn’t seem to make any difference.

That said, small records do incur overhead due to extra framing, plus more CPU
cycles (more MACs and framing processing). So, in some instances, if you’re
delivering large streams (e.g. video), you may want to use larger records…
Exposing record size as a configurable option would address this.

Absolutely. Before I said Google uses a 1470 bytes frame size, but that
is not true for example when it comes to streaming from Youtube. Here
they use the standard MTU, and also large, fragmenting TLS records. So
like you said it’s important to look at the application you’re trying to
optimize. +1 for the configurable TLS record size option. To pick up
from the code Maxim just posted, perhaps the record size could be even
dynamically altered within location blocks (to specify different record
sizes for large and small streams).

archfiend997 · December 19, 2013, 11:52am

On 12/19/13 04:50, Alex wrote:

I remember reading (I believe it was in your (excellent) book! ;)) that
upon packet loss, the full TLS record has to be retransmitted. Not cool
if the TLS record is large and fragmented. So that’s indeed a good
reason to keep TLS records small and preferably within the size of a TCP
segment.

Why TCP retransmit for single lost packet is not enough (in kernel TCP
stack,
whit is unaware of TLS record)?
Kernel on receiver side, should wait for this lost packet to retransmit,
and
return data to application in same order as it was sent.

Big TLS record can add some delay for first byte (but not to last byte)
in
decrypted page, but browser anyway can’t render first byte of page, It
need at
least some data.

archfiend997 · December 19, 2013, 1:06am

On Tue, Dec 17, 2013 at 7:59 PM, Alex [email protected] wrote:

I did some empirical testing and with my configuration (given cipher
size, padding, and all), I came to 1370 bytes as being the optimal size
for avoid fragmenting of TLS record fragmentation.

Ah, right, we’re not setting the “total” record size… Rather, we’re
setting the maximum payload size within the record. On top of that there
is
the extra 5 bytes for the record header, plus MAC and padding (if block
cipher is used) – so that’s 5 bytes + up to 32 extra bytes per record.
Add
IP (40 bytes for IPv6), TCP header (20), and some room for TCP options
(40), and we’re looking at ~1360 bytes… Which is close to what you’re
seeing in your testing.

For interactive traffic I think that’s generally true as it eliminates
the
edge case of CWND overflows (extra RTT of buffering) and minimizes
impact
of packet reordering and packet loss. FWIW, for these exact reasons the
Google frontend servers have been using TLS record = TCP segment for a
few
years now… So there is good precedent to using this as a default.

That said, small records do incur overhead due to extra framing, plus
more
CPU cycles (more MACs and framing processing). So, in some instances, if
you’re delivering large streams (e.g. video), you may want to use larger
records… Exposing record size as a configurable option would address
this.

On Wed, Dec 18, 2013 at 8:38 AM, Maxim D. [email protected]
wrote:

out that it’s very non-trivial task to fill exactly one packet -
as space in packets may vary depending on TCP options used, MTU,
tunnels used on a way to a client, etc.

Yes, that’s a good point.

On the other hand, it looks good enough to have records up to
initial CWND in size without any significant latency changes. And
with IW10 this basically means that anything up to about 14k
should be fine (with RFC3390, something like 4k should be ok).
It also reduces bandwidth costs associated with using multiple
records.

In theory, I agree with you, but in practice even while trying to play
with
this on my own server it appears to be more tricky than that: to
~reliably
avoid the CWND overflow I have to set the record size <10k… There are
also
differences in how the CWND is increased (byte based vs packet based)
across different platforms, and other edge cases I’m surely overlooking.
Also, while this addresses the CWND overflow during slowstart, smaller
records offer additional benefits as they help minimize impact of
reordering and packet loss (not eliminate, but reduce its negative
impact
in some cases).

Just in case, below is a patch to play with SSL buffer size:

HG changeset patch

User Maxim D. [email protected]

Date 1387302972 -14400

Tue Dec 17 21:56:12 2013 +0400

Node ID 090a57a2a599049152e87693369b6921efcd6bca

Parent e7d1a00f06731d7508ec120c1ac91c337d15c669

SSL: ssl_buffer_size directive.

Just tried it on my local server, works as advertised.

Defaults matter and we should optimize for best performance out of the
box… Can we update NGX_SSL_BUFSIZE size as part of this patch? My
current
suggestion is 1360 bytes as this guarantees best possible case for
helping
the browser start processing data as soon as possible: minimal impact of
reordering / packet loss / no CWND overflows.

archfiend997 · December 19, 2013, 2:16pm

Hello!

On Wed, Dec 18, 2013 at 04:04:59PM -0800, Ilya G. wrote:

avoid the CWND overflow I have to set the record size <10k… There are also
differences in how the CWND is increased (byte based vs packet based)
across different platforms, and other edge cases I’m surely overlooking.
Also, while this addresses the CWND overflow during slowstart, smaller
records offer additional benefits as they help minimize impact of
reordering and packet loss (not eliminate, but reduce its negative impact
in some cases).

The problem that there are even more edge cases with packet-sized
records. Also, in practice with packet-sized records there seems
to be significant difference in throughput. In my limited testing
packet-sized records resulted in 2x slowdown on large responses.
Of course the overhead may be somewhat reduced by applying smaller
records deeper in the code, but a) even in theory, there is some
overhead, and b) it doesn’t looks like a trivial task when using
OpenSSL. Additionally, there may be wierd “Nagle vs. delayed ack”
related effects on fast connections, it needs additional
investigation.

As of now, I tend to think that 4k (or 8k on systems with IW10)
buffer size is optimal for latency-sensitive workloads.

Just tried it on my local server, works as advertised.

Defaults matter and we should optimize for best performance out of the
box… Can we update NGX_SSL_BUFSIZE size as part of this patch? My current
suggestion is 1360 bytes as this guarantees best possible case for helping
the browser start processing data as soon as possible: minimal impact of
reordering / packet loss / no CWND overflows.

I don’t think that changing the default is a good idea, it
may/will cause performance degradation with large requests, see
above. While reducing latency is important in some cases, it’s
certainly not the only thing to consider during performance
optimization.

–
Maxim D.
http://nginx.org/

archfiend997 · December 20, 2013, 12:56am

On Thu, Dec 19, 2013 at 2:51 AM, Anton Y. [email protected]
wrote:

stack, whit is unaware of TLS record)?
Kernel on receiver side, should wait for this lost packet to retransmit,
and return data to application in same order as it was sent.

Yep, no need to retransmit the record, just the lost packet… The
entire
record is buffered on the client until all the packets are available,
after
that the MAC is verified and contents are decrypted + finally passed to
the
application.

On Wed, Dec 18, 2013 at 4:50 PM, Alex [email protected] wrote:

requires tuning of the underlying application as well? Google uses a
with TCP_CORK (tcp_nopush), but it didn’t seem to make any difference.

Right, I ran into the same issue when testing it on this end. The very
first record goes into first packet, and then some extra (30~50) bytes
of
following record are padded into it… from thereon, most records span
two
packets. The difference with GFE’s is that they flush the packet on each
record boundary.

Perhaps some nginx guru’s can help with this one?

That said, small records do incur overhead due to extra framing, plus
more CPU cycles (more MACs and framing processing). So, in some instances,
if you’re delivering large streams (e.g. video), you may want to use larger
records… Exposing record size as a configurable option would address this.

Absolutely. Before I said Google uses a 1470 bytes frame size, but that
is not true for example when it comes to streaming from Youtube. Here
they use the standard MTU, and also large, fragmenting TLS records.

Actually, it should be even smarter: connection starts with small record
sizes to get fast time to first frame (exact same concerns as TTFB for
HTML), and then record size is increased as connection opens up. Not
sure
if that’s been officially rolled out 100%, but I do know that this was
the
plan. The benefit here is there is no application tweaking required. I’d
love to see this in nginx as well.

On Thu, Dec 19, 2013 at 5:15 AM, Maxim D. [email protected]
wrote:

Also, while this addresses the CWND overflow during slowstart, smaller
overhead, and b) it doesn’t looks like a trivial task when using
OpenSSL. Additionally, there may be wierd “Nagle vs. delayed ack”
related effects on fast connections, it needs additional
investigation.

As of now, I tend to think that 4k (or 8k on systems with IW10)
buffer size is optimal for latency-sensitive workloads.

If we assume that new systems are using IW10 (which I think is
reasonable),
then an 8K default is a good / simple middle-ground.

Alternatively, what are your thoughts on making this adjustment
dynamically? Start the connection with small record size, then bump it
to
higher limit? In theory, that would also avoid the extra config flag.

archfiend997 · December 20, 2013, 8:09pm

On Thu, Dec 19, 2013 at 4:21 PM, Maxim D. [email protected]
wrote:

Alternatively, what are your thoughts on making this adjustment
dynamically? Start the connection with small record size, then bump it to
higher limit? In theory, that would also avoid the extra config flag.

In theory, this may be intresting, and I thought about it too.
But I don’t think it will stop people from asking us to add a
configuration directive anyway, and if 4k/8k will work fine -
there should be no need to add extra complexity here.

For others following the discussion… Followed up on:
http://mailman.nginx.org/pipermail/nginx-devel/2013-December/004703.html

archfiend997 · December 20, 2013, 1:22am

Hello!

On Thu, Dec 19, 2013 at 03:55:05PM -0800, Ilya G. wrote:

As of now, I tend to think that 4k (or 8k on systems with IW10)
buffer size is optimal for latency-sensitive workloads.

If we assume that new systems are using IW10 (which I think is reasonable),
then an 8K default is a good / simple middle-ground.

Alternatively, what are your thoughts on making this adjustment
dynamically? Start the connection with small record size, then bump it to
higher limit? In theory, that would also avoid the extra config flag.

In theory, this may be intresting, and I thought about it too.
But I don’t think it will stop people from asking us to add a
configuration directive anyway, and if 4k/8k will work fine -
there should be no need to add extra complexity here.

–
Maxim D.
http://nginx.org/