Forum: NGINX Strange $upstream_response_time latency spikes with reverse proxy

Posted by Jay Oster (Guest)
on 2013-03-15 03:07
(Received via mailing list)
Hi list!

I'm trying to debug an interesting problem where we randomly get a "high
latency" response time from our upstream servers. It appears to occur in
about 1.5% of all requests. Here's a description of the tests I've been
running to isolate the problem within nginx:

I'm using an endpoint on the upstream servers that operates extremely
quickly; a request which only responds back with the server's current 
local
UNIX timestamp. From the nginx server, I start ApacheBench with 5,000
concurrent connections directly to the upstream server (bypassing 
nginx).
Here is what a typical run of this test looks like:


Document Path:          /time/0
Document Length:        19 bytes

Concurrency Level:      5000
Time taken for tests:   0.756 seconds
Complete requests:      5000
Failed requests:        0
Write errors:           0
Total transferred:      1110000 bytes
HTML transferred:       95000 bytes
Requests per second:    6617.33 [#/sec] (mean)
Time per request:       755.592 [ms] (mean)
Time per request:       0.151 [ms] (mean, across all concurrent 
requests)
Transfer rate:          1434.62 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        4   63  53.7     35     167
Processing:    22   44  19.1     38     249
Waiting:       17   35  18.8     30     243
Total:         55  107  64.4     73     401

Percentage of the requests served within a certain time (ms)
  50%     73
  66%     77
  75%     84
  80%    202
  90%    222
  95%    231
  98%    250
  99%    251
 100%    401 (longest request)



And here's the same test with the longest response times I've seen:


Document Path:          /time/0
Document Length:        19 bytes

Concurrency Level:      5000
Time taken for tests:   0.807 seconds
Complete requests:      5000
Failed requests:        0
Write errors:           0
Total transferred:      1110000 bytes
HTML transferred:       95000 bytes
Requests per second:    6197.08 [#/sec] (mean)
Time per request:       806.831 [ms] (mean)
Time per request:       0.161 [ms] (mean, across all concurrent 
requests)
Transfer rate:          1343.51 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        3   45  51.8     17     144
Processing:    10   29  24.4     23     623
Waiting:        9   25  24.4     18     623
Total:         26   75  67.4     39     626

Percentage of the requests served within a certain time (ms)
  50%     39
  66%     42
  75%     45
  80%    173
  90%    190
  95%    199
  98%    217
  99%    218
 100%    626 (longest request)



Not bad. Now, keep in mind, this is a SINGLE upstream server handling 
these
requests over the network. Once I change my test to point ab at the 
local
nginx, the strange latency issue rears its ugly head. I have 4 upstream
servers in my config.

Here's what the same test through nginx looks like:


Concurrency Level:      5000
Time taken for tests:   1.602 seconds
Complete requests:      5000
Failed requests:        0
Write errors:           0
Total transferred:      1170000 bytes
HTML transferred:       95000 bytes
Requests per second:    3121.08 [#/sec] (mean)
Time per request:       1602.012 [ms] (mean)
Time per request:       0.320 [ms] (mean, across all concurrent 
requests)
Transfer rate:          713.21 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      109  172  39.4    167     246
Processing:   106  505 143.3    530    1248
Waiting:      103  504 143.5    530    1248
Total:        344  677 108.6    696    1358

Percentage of the requests served within a certain time (ms)
  50%    696
  66%    723
  75%    741
  80%    752
  90%    768
  95%    779
  98%    786
  99%    788
 100%   1358 (longest request)



Ack! It's like nginx decides to drop an extra second on some requests 
for
no reason.

I've also recorded these test runs with nginx's access log. Here's the 
log
format, first:


    log_format main '$remote_addr - - '                 ## User's IP 
Address
                    '[$time_local] '                    ## DateTime
                    '"$request" '                       ## User's 
Request
URL
                    '$status '                          ## HTTP Code
                    '$body_bytes_sent '                 ## Bytes BODY 
ONLY
                    '"$http_referer" '                  ## User's 
Referer
                    '"$http_user_agent" '               ## User's Agent
                    '$request_time '                    ## NginX 
Response
                    '$upstream_response_time '          ## Upstream 
Response
                    '$bytes_sent '                      ## Bytes Sent 
(GZIP)
                    '$request_length';                  ## String Length



The access log has 10,000 lines total (i.e. two of these tests with 
5,000
concurrent connections), and when I sort by upstream_response_time, I 
get a
log with the first 140 lines having about 1s on the 
upstream_response_time,
and the remaining 9,860 lines show 700ms and less. Here's a snippet 
showing
the strangeness, starting with line numbers:


1:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83
2:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83
3:     127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 1.026 1.025 234 83
...
138:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 1.000 0.999 234 81
139:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81
140:   127.0.0.1 - - [14/Mar/2013:17:57:18 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.999 0.999 234 81
141:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
142:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
143:   127.0.0.1 - - [14/Mar/2013:17:37:21 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
...
9998:  127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81
9999:  127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.142 0.005 234 81
10000: 127.0.0.1 - - [14/Mar/2013:17:57:16 -0700] "GET /time/0 HTTP/1.0"
200 19 "-" "ApacheBench/2.3" 0.122 0.002 234 81



The upstream_response_time difference between line 140 and 141 is nearly
500ms! The total request_time also displays an interesting gap of almost
300ms. What's going on here?

The kernels have been tuned on all servers for a high number of open 
files,
and tcp buffers:

$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 119715
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 119715
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited



$ cat /proc/sys/net/core/*mem_*
184217728
184217728
184217728
184217728
184217728



Also for reference, here is part of my nginx.conf which may be useful 
for
diagnosis:


worker_processes 7;
worker_rlimit_nofile 500000;

events {
    use epoll;
    worker_connections 500000;
    multi_accept on;
}

http {
    log_format main ...

    access_log ...

    ##
-----------------------------------------------------------------------
    ## TCP Tuning
    ##
-----------------------------------------------------------------------
    sendfile        off;
    tcp_nopush      off;
    tcp_nodelay     on;

    ##
-----------------------------------------------------------------------
    ## Max Data Size
    ##
-----------------------------------------------------------------------
    client_max_body_size         1k;
    client_body_buffer_size      1k;
    client_header_buffer_size    32k;
    large_client_header_buffers  200 32k;

    ##
-----------------------------------------------------------------------
    ## GZIP
    ##
-----------------------------------------------------------------------
    gzip                      on;
    gzip_min_length           1000;
    gzip_disable              msie6;
    gzip_proxied              any;
    gzip_buffers              100 64k;
    gzip_types                text/javascript;

    ##
-----------------------------------------------------------------------
    ## Proxy Load Distribution
    ##
-----------------------------------------------------------------------
    proxy_redirect            off;
    proxy_connect_timeout     5;
    proxy_send_timeout        5;
    proxy_read_timeout        8;
    proxy_next_upstream       error timeout invalid_header http_500
http_502 http_503 http_504;
    proxy_buffering           off;

    ##
-----------------------------------------------------------------------
    ## Hide 'Server: nginx' Server Header
    ##
-----------------------------------------------------------------------
    server_tokens             off;
    proxy_pass_header         Server;

    upstream ...

    server ...
}
Posted by Maxim Dounin (Guest)
on 2013-03-15 09:21
(Received via mailing list)
Hello!

On Thu, Mar 14, 2013 at 07:07:20PM -0700, Jay Oster wrote:

[...]

> 200 19 "-" "ApacheBench/2.3" 1.027 1.026 234 83
> 200 19 "-" "ApacheBench/2.3" 0.708 0.568 234 83
> 200 19 "-" "ApacheBench/2.3" 0.122 0.002 234 81
>
>
>
> The upstream_response_time difference between line 140 and 141 is nearly
> 500ms! The total request_time also displays an interesting gap of almost
> 300ms. What's going on here?

I would suggests there are packet loss and retransmits for some
reason.  Try tcpdump'ing traffic between nginx and backends to see
what goes on in details.

--
Maxim Dounin
http://nginx.org/en/donation.html
Posted by Jay Oster (Guest)
on 2013-03-16 09:37
(Received via mailing list)
Hi Maxim,

Thanks for the suggestion! It looks like packet drop is the culprit 
here.
The initial SYN packet doesn't receive a corresponding SYN-ACK from the
upstream servers, so after a 1-second timeout (TCP Retransmission 
TimeOut),
the packet is retransmitted. The question is still *why* this only 
occurs
through nginx.

To further narrow down the root cause, I moved my upstream server to the
same machine with nginx. The issue can still be replicated there. To
eliminate my upstream server as the cause (it's written in C with 
libevent,
by the way) I used the nodejs hello world demo; nodejs has trouble with 
the
5,000 concurrent connections (go figure) but the connections that are
successful (without nginx reverse proxying) all complete in less than 
one
second. When I place nginx between ApacheBench and nodejs, that 1-second
TCP RTO shows up again.

To reiterate, this is all happening on a single machine; the TCP stack 
is
involved, but not a physical network. The only common denominator is 
nginx.
Posted by Andrew Alexeev (Guest)
on 2013-03-16 16:05
(Received via mailing list)
Jay,

On Mar 16, 2013, at 12:37 PM, Jay Oster wrote:

> Hi Maxim,
>
> Thanks for the suggestion! It looks like packet drop is the culprit here. The 
initial SYN packet doesn't receive a corresponding SYN-ACK from the upstream 
servers, so after a 1-second timeout (TCP Retransmission TimeOut), the packet is 
retransmitted. The question is still *why* this only occurs through nginx.
>
> To further narrow down the root cause, I moved my upstream server to the same 
machine with nginx. The issue can still be replicated there. To eliminate my 
upstream server as the cause (it's written in C with libevent, by the way) I used 
the nodejs hello world demo; nodejs has trouble with the 5,000 concurrent 
connections (go figure) but the connections that are successful (without nginx 
reverse proxying) all complete in less than one second. When I place nginx between 
ApacheBench and nodejs, that 1-second TCP RTO shows up again.

You mean you keep seeing SYN-ACK loss through loopback?

That might sound funny, but what's the OS and the overall environment of 
that strangely behaving machine with nginx? Is it a virtualized one? Is 
the other machine any different? The more details you can provide, the 
better :)

> To reiterate, this is all happening on a single machine; the TCP stack is 
involved, but not a physical network. The only common denominator is nginx.

Can you try the same tests on the other machine, where you originally 
didn't have any problems with your application? That is, can you repeat 
nginx+app on the other machine and see if the above strange behavior 
persists?
Posted by Maxim Dounin (Guest)
on 2013-03-17 00:39
(Received via mailing list)
Hello!

On Sat, Mar 16, 2013 at 01:37:22AM -0700, Jay Oster wrote:

> Hi Maxim,
>
> Thanks for the suggestion! It looks like packet drop is the culprit here.
> The initial SYN packet doesn't receive a corresponding SYN-ACK from the
> upstream servers, so after a 1-second timeout (TCP Retransmission TimeOut),
> the packet is retransmitted. The question is still *why* this only occurs
> through nginx.

Have you tried looking on tcpdump on both backend and nginx host?
This might help to further narrow down the problem.

I could see two possible causes here:

1) A trivial one.  Listen queue of your backend service is
exhausted, and the SYN packet is dropped due to this.  This can be
easily fixed by using bigger listen queue, and also easy enough to
track as there are listen queue overflow counters available in
most OSes.

2) Some other queue in the network stack is exhausted.  This might
be nontrivial to track (but usually possible too).

> involved, but not a physical network. The only common denominator is nginx.
Use of nginx may result in another distribution of connection
attempts to a backend, resulting in bigger SYN packet bursts
(especially if you use settings like multi_accept).

--
Maxim Dounin
http://nginx.org/en/donation.html
Posted by Jason Oster (Guest)
on 2013-03-17 10:18
(Received via mailing list)
Hello Andrew,

On Mar 16, 2013, at 8:05 AM, Andrew Alexeev <andrew@nginx.com> wrote:

> You mean you keep seeing SYN-ACK loss through loopback?
That appears to be the case, yes. I've captured packets with tcpdump, 
and load them into Wireshark for easier visualization. I can see a very 
clear gap where no packets are transmitting for over 500ms, then a burst 
of ~10 SYN packets. When I look at the TCP stream flow on these SYN 
bursts, it shows an initial SYN packet almost exactly 1 second earlier 
without a corresponding SYN-ACK. I'm taking the 1-second delay to be the 
RTO. I can provide some pieces of the tcpdump capture log on Monday, to 
help illustrate.

> That might sound funny, but what's the OS and the overall environment of that 
strangely behaving machine with nginx? Is it a virtualized one? Is the other 
machine any different? The more details you can provide, the better :)

It's a 64-bit Ubuntu 12.04 VM, running on an AWS m3.xlarge. Both VMs are 
configured the same.

>> To reiterate, this is all happening on a single machine; the TCP stack is 
involved, but not a physical network. The only common denominator is nginx.
>
> Can you try the same tests on the other machine, where you originally didn't 
have any problems with your application? That is, can you repeat nginx+app on the 
other machine and see if the above strange behavior persists?

Same configuration. I'm investigating this issue because it is common 
across literally dozens of servers we have running in AWS. It occurs in 
all regions, and on all instance types. This "single server" test is the 
first time the software has been run with nginx load balancing to 
upstream processes on the same machine.
Posted by Jason Oster (Guest)
on 2013-03-17 10:24
(Received via mailing list)
Hi again, Maxim!

On Mar 16, 2013, at 4:39 PM, Maxim Dounin <mdounin@mdounin.ru> wrote:

>> through nginx.
>
> Have you tried looking on tcpdump on both backend and nginx host?
> This might help to further narrow down the problem.

I haven't yet, but I will restart my investigation there on Monday. 
Capturing packets on both sides during the same time frame may reveal 
something I haven't seen yet.

> I could see two possible causes here:
>
> 1) A trivial one.  Listen queue of your backend service is
> exhausted, and the SYN packet is dropped due to this.  This can be
> easily fixed by using bigger listen queue, and also easy enough to
> track as there are listen queue overflow counters available in
> most OSes.

Overflow queue is configured to 1024 on these hosts, though nothing 
changes when I increase it. I can however make the delay much longer by 
making the queue smaller.

> 2) Some other queue in the network stack is exhausted.  This might
> be nontrivial to track (but usually possible too).

This is interesting, and could very well be it! Do you have any 
suggestions on where to start looking?

>> To reiterate, this is all happening on a single machine; the TCP stack is
>> involved, but not a physical network. The only common denominator is nginx.
>
> Use of nginx may result in another distribution of connection
> attempts to a backend, resulting in bigger SYN packet bursts
> (especially if you use settings like multi_accept).

Got it. I don't think multi_accept is being used (it's not in the nginx 
config).

Thank you.
Posted by Maxim Dounin (Guest)
on 2013-03-17 12:43
(Received via mailing list)
Hello!

On Sun, Mar 17, 2013 at 02:23:20AM -0700, Jason Oster wrote:

[...]

> > 1) A trivial one.  Listen queue of your backend service is
> > exhausted, and the SYN packet is dropped due to this.  This
> > can be easily fixed by using bigger listen queue, and also
> > easy enough to track as there are listen queue overflow
> > counters available in most OSes.
>
> Overflow queue is configured to 1024 on these hosts, though
> nothing changes when I increase it. I can however make the delay
> much longer by making the queue smaller.

On "these hosts"?  Note that listen queue aka backlog size is
configured in _applications_ which call listen().  At a host level
you may only configure somaxconn, which is maximum allowed listen
queue size (but an application may still use anything lower, even
just 1).

Make sure to check actual listen queue sizes used on listen
sockets involved.  On Linux (you are using Linux, right?) this
should be possible with "ss -nlt" (or "netstat -nlt").

> > 2) Some other queue in the network stack is exhausted.  This
> > might be nontrivial to track (but usually possible too).
>
> This is interesting, and could very well be it! Do you have any
> suggestions on where to start looking?

I'm not a Linux expert, but quick search suggests it should be
possible with dropwatch, see e.g. here:

http://prefetch.net/blog/index.php/2011/07/11/usin...

--
Maxim Dounin
http://nginx.org/en/donation.html
Posted by Jay Oster (Guest)
on 2013-03-18 22:20
(Received via mailing list)
Hi Maxim,

On Sun, Mar 17, 2013 at 4:42 AM, Maxim Dounin <mdounin@mdounin.ru> 
wrote:

> Hello!
>
> On "these hosts"?  Note that listen queue aka backlog size is
> configured in _applications_ which call listen().  At a host level
> you may only configure somaxconn, which is maximum allowed listen
> queue size (but an application may still use anything lower, even
> just 1).
>

"These hosts" means we have a lot of servers in production right now, 
and
they all exhibit the same issue. It hasn't been a showstopper, but it's
been occurring for as long as anyone can remember. The total number of
upstream servers on a typical day is 6 machines (each running 3 service
processes), and hosts running nginx account for another 4 machines. All 
of
these are Ubuntu 12.04 64-bit VMs running on AWS EC2 m3.xlarge instance
types.

I was under the impression that /proc/sys/net/ipv4/tcp_max_syn_backlog 
was
for configuring the maximum queue size on the host. It's set to 1024, 
here,
and increasing the number doesn't change the frequency of the missed
packets.

/proc/sys/net/core/somaxconn is set to 500,000

Make sure to check actual listen queue sizes used on listen
> sockets involved.  On Linux (you are using Linux, right?) this
> should be possible with "ss -nlt" (or "netstat -nlt").


According to `ss -nlt`, send-q on these ports is set to 128. And recv-q 
on
all ports is 0. I don't know what this means for recv-q, use default? 
And
would default be 1024?

But according to `netstat -nlt` both queues are 0?


> 
http://prefetch.net/blog/index.php/2011/07/11/usin...
Thanks for the tip! I'll take some time to explore this some more. And
before anyone asks, I'm not using iptables or netfilter. That appears to 
be
a common cause for TCP overhead when investigating similar issues.

Jay
Posted by Jay Oster (Guest)
on 2013-03-18 23:09
(Received via mailing list)
_______________________________________________
nginx mailing list
nginx@nginx.org
http://mailman.nginx.org/mailman/listinfo/nginx
Posted by Andrei Belov (Guest)
on 2013-03-19 10:50
(Received via mailing list)
Hello Jay,

On Mar 19, 2013, at 2:09 , Jay Oster <jay@kodewerx.org> wrote:

> That appears to be the case, yes. I've captured packets with tcpdump, and load 
them into Wireshark for easier visualization. I can see a very clear gap where no 
packets are transmitting for over 500ms, then a burst of ~10 SYN packets. When I 
look at the TCP stream flow on these SYN bursts, it shows an initial SYN packet 
almost exactly 1 second earlier without a corresponding SYN-ACK. I'm taking the 
1-second delay to be the RTO. I can provide some pieces of the tcpdump capture log 
on Monday, to help illustrate.
>
> I double-checked, and the packet loss does *not* occur on loopback interface. It 
does occur when hitting the network with a machine's own external IP address, 
however. This is within Amazon's datacenter; the packets bounce through their 
firewall before returning to the VM.

If I understand you right, issue can be repeated in the following cases:

1) client and server are on different EC2 instances, public IPs are 
used;
2) client and server are on different EC2 instances, private IPs are 
used;
3) client and server are on a single EC2 instance, public IP is used.

And there are no problems when:

1) client and server are on a single EC2 instance, either loopback or 
private IP is used.

Please correct me if I'm wrong.

What about EC2 security group - do the client and the server use the 
same group?
How many rules are present in this group? Have you tried to either 
decrease
a number of rules used, or create "pass any to any" simple 
configuration?

And just to clarify the things - under "external IP address" do you mean 
EC2
instance's public IP, or maybe Elastic IP?
Posted by Maxim Dounin (Guest)
on 2013-03-19 15:19
(Received via mailing list)
Hello!

On Mon, Mar 18, 2013 at 02:19:26PM -0700, Jay Oster wrote:

> they all exhibit the same issue. It hasn't been a showstopper, but it's
>
> /proc/sys/net/core/somaxconn is set to 500,000

As far as I understand, tcp_max_syn_backlog configures global
cumulative limit for all listening sockets, while somaxconn limits
one listening socket backlog.  If any of the two is too small -
you'll see SYN packets dropped.

> > Make sure to check actual listen queue sizes used on listen
> > sockets involved.  On Linux (you are using Linux, right?) this
> > should be possible with "ss -nlt" (or "netstat -nlt").
>
>
> According to `ss -nlt`, send-q on these ports is set to 128. And recv-q on
> all ports is 0. I don't know what this means for recv-q, use default? And
> would default be 1024?

In "ss -nlt" output send-q column is used to display listen queue
size for listen sockets.  Number 128 here means you have listen
queue for 128 connections only.  You should tune your backends to
use bigger listen queues, 128 is certanly too small for concurency
5000 you use in your tests.

(The recv-q column should indicate current number of connections
in listen queue.)

> But according to `netstat -nlt` both queues are 0?

This means that netstat isn't showing listen queue sizes on your
host.  It looks like many linux systems still always display 0 for
listen sockets.

--
Maxim Dounin
http://nginx.org/en/donation.html
Posted by Jay Oster (Guest)
on 2013-03-19 23:17
(Received via mailing list)
Hi Maxim,

On Tue, Mar 19, 2013 at 7:19 AM, Maxim Dounin <mdounin@mdounin.ru> 
wrote:

> >
> 5000 you use in your tests.
>
> (The recv-q column should indicate current number of connections
> in listen queue.)


This is an excellent tip, thank you! Regardless of whether it fully
resolves this issue, I will see about tuning the individual listen 
socket
queues. The server is using libevent's asynchronous HTTP server module, 
so
I'm not sure how much control I have over the socket options. But I will
investigate.


> > But according to `netstat -nlt` both queues are 0?
>
> This means that netstat isn't showing listen queue sizes on your
> host.  It looks like many linux systems still always display 0 for
> listen sockets.


Pretty strange. Oh well. `ss` works for me.
Posted by Jay Oster (Guest)
on 2013-03-19 23:43
(Received via mailing list)
Hi Andrei!

On Tue, Mar 19, 2013 at 2:49 AM, Andrei Belov <defan@nginx.com> wrote:

> 1) client and server are on a single EC2 instance, either loopback or
> private IP is used.
>
> Please correct me if I'm wrong.
>

If by "client" you mean nginx, and by "server" you mean our upstream 
HTTP
service ... That is exactly correct. You could also throw in another
permutation by changing where ApacheBench is run, but it doesn't change 
the
occurrence of dropped packets; only increases average latency when AB 
and
nginx are on separate EC2 instances.


> What about EC2 security group - do the client and the server use the same
> group?
> How many rules are present in this group? Have you tried to either decrease
> a number of rules used, or create "pass any to any" simple configuration?
>

That's a great point! We have been struggling with the number of 
firewall
rules as a separate matter, in fact. There may be some relation here. 
Thank
you for reminding me.


> And just to clarify the things - under "external IP address" do you mean
> EC2
> instance's public IP, or maybe Elastic IP?


I'm talking about the instance public IPs. Elastic IPs are only used for
client access to nginx. And specifically only for managing DNS. Between
nginx and upstream servers, the public IPs are used.
Posted by Jay Oster (Guest)
on 2013-03-23 03:12
(Received via mailing list)
Hi again everyone!

Just posting a status update (because I hate coming across old threads 
with
reports of a problem I'm experiencing, and there is no answer!) What 
I've
found so far is starting to look like a Linux kernel bug that was fixed 
for
ipv6, but still remains for ipv4! Here's the relevant discussion:
https://groups.google.com/forum/?fromgroups=#!topi...

And thanks for making nginx awesome! :)
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.