High response time at high concurrent connections

addis_a · September 30, 2013, 12:50am

I have a news site which is run by 4 tornado instances and nginx as
reverse
proxy in front of them.

Pages are rendered and cached in memcached so generally the response
time is
less than 3 ms according to tornado logs.

[I 130918 18:35:37 web:1462] 200 GET / (...) 2.43ms
[I 130918 18:35:37 web:1462] 200 GET / (...) 3.41ms
[I 130918 18:35:37 web:1462] 200 GET / (...) 1.96ms
[I 130918 18:35:37 web:1462] 200 GET / (...) 2.48ms
[I 130918 18:35:37 web:1462] 200 GET / (...) 4.09ms
[I 130918 18:35:37 web:1462] 200 GET / (...) 2.43ms
[I 130918 18:35:37 web:1462] 200 GET / (...) 2.49ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 2.25ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 2.39ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.93ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.70ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 2.08ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.72ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 2.02ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.70ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.74ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.85ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.60ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 1.83ms
[I 130918 18:35:38 web:1462] 200 GET / (...) 2.65ms

When I test this site with ab at concurrency level 1000 I get response
times
around 0.8 seconds. Here is the benchmark result:

Document Length: 12036 bytes

Concurrency Level: 1000
Time taken for tests: 7.974 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 10000
Total transferred: 122339941 bytes
HTML transferred: 120549941 bytes
Requests per second: 1254.07 [#/sec] (mean)
Time per request: 797.407 [ms] (mean)
Time per request: 0.797 [ms] (mean, across all concurrent
requests)
Transfer rate: 14982.65 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 7 20.8 0 86
Processing: 57 508 473.9 315 7014
Waiting: 57 508 473.9 315 7014
Total: 143 515 471.5 321 7014

Percentage of the requests served within a certain time (ms)
50% 321
66% 371
75% 455
80% 497
90% 1306
95% 1354
98% 1405
99% 3009
100% 7014 (longest request)

I can handle ~1200 requests/seconds with 1000 concurrent connections and
when I do the same benchmark with 100 concurrent connections I can again
handle around 1200 requests/second but response time drops to ~80 ms.

When it comes to real life with 1000 concurrent connections users will
face
0.8 seconds response time which I think is a bad value.

My question is why response times increase when concurrency level is
increased?

Posted at Nginx Forum:

laltin · September 30, 2013, 4:26am

Hello!

On Sun, Sep 29, 2013 at 3:50 PM, laltin wrote:

I can handle ~1200 requests/seconds with 1000 concurrent connections and
when I do the same benchmark with 100 concurrent connections I can again
handle around 1200 requests/second but response time drops to ~80 ms.

What you’re seeing is a very common phenomenon.

Tools like ab always try to load the target server to its extreme
throughput, i.e., the 1200 req/sec you’re seeing is already the
throughput limit for that service in your server. Because you’re
already at the throughput limit, you have to sacrifice response
latency for higher concurrency level otherwise you’ll just have much
higher throughput which is impossible (remember you’re already on the
limit? ;)).

Regards,
-agentzh

laltin · September 30, 2013, 9:32am

My question is why response times increase when concurrency level is
increased?

What are your worker_processes and worker_connections values?

p.s. also to be sure is the OS open files limit increased from the
typical
default of 1024

rr

laltin · September 30, 2013, 4:25pm

But looking at tornado logs I expect around 2000reqs/sec. Assuming that
each
request is handled in 2ms one instance can handle 500reqs/sec and with 4
instances it sholud be 2000req/sec. But it is stuck at 1200reqs/sec I
wonder
why it is stuck at that point?
Does increasing the number of instances change the result?

Posted at Nginx Forum:

laltin · October 1, 2013, 5:03am

Hello!

On Mon, Sep 30, 2013 at 7:25 AM, laltin wrote:

But looking at tornado logs I expect around 2000reqs/sec. Assuming that each
request is handled in 2ms one instance can handle 500reqs/sec and with 4
instances it sholud be 2000req/sec. But it is stuck at 1200reqs/sec I wonder
why it is stuck at that point?
Does increasing the number of instances change the result?

It depends on where the bottleneck really is

You can use the on-CPU flamegraph and off-CPU flamegraph tools to
check the on-CPU time and off-CPU time usage by sampling your related
processes under load. In particular, using the following tools on
Linux:

https://github.com/agentzh/nginx-systemtap-toolkit#ngx-sample-bt
https://github.com/agentzh/nginx-systemtap-toolkit#ngx-sample-bt-off-cpu

These tools are very general and not specific to Nginx processes, BTW.

Best regards,
-agentzh

laltin · October 2, 2013, 11:02pm

At that level of concurrency you are only testing the throughput of ab
and
not nginx. Apache bench is a single process and is too slow to test
anything
other than a single Apache instance.

In summary, ab is a piece of crap.

Posted at Nginx Forum:

laltin · October 2, 2013, 11:43pm

It soulds like ab is succeeding in generating a load high enough to
impact
performance already. You need to figure out where the bottlenecks are.

My message is intended as a helpful hint that you should not trust the
stats
from ab. You need to be measuring perf on the nginx and backend servers.
Start by looking at the elapsed time in nginx and the response times of
the
backends.

Posted at Nginx Forum:

laltin · October 2, 2013, 11:13pm

But while benchmarking with ab I test the site with a browser and I
experience high response times. What do you recommend for testing?

Posted at Nginx Forum:

laltin · October 3, 2013, 12:25am

Hello.

Perhaps this page can be useful for you to choose tools for testing:

Greetings,

Oscar

laltin · October 4, 2013, 4:49pm

My backend logs show response times generally less than 2ms and with 4
instances backends can handle 2000 reqs/sec.
But when I look at nginx logs I see response times around 200ms and i
think
nginx is the main problem with my situation.

Posted at Nginx Forum:

laltin · October 4, 2013, 9:01pm

Hello!

On Fri, Oct 4, 2013 at 7:48 AM, laltin wrote:

My backend logs show response times generally less than 2ms and with 4
instances backends can handle 2000 reqs/sec.
But when I look at nginx logs I see response times around 200ms and i think
nginx is the main problem with my situation.

As I’ve suggested, use the off-CPU and on-CPU flamegraph tool to check
your loaded Nginx worker processes. It could easily be a configuration
issue or something. Please do that.

See

https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt

and

https://github.com/agentzh/nginx-systemtap-toolkit#sample-bt-off-cpu

Also, ensure your Nginx is not flooding your Nginx’s error log file,
which is very expensive. If it is, you can also clearly see it in the
flame graphs anyway.

Stop guessing, start tracing!

Best regards,
-agentzh