Quick performance deterioration when No. of clients increases


#1

Hello,

I am trying to migrate a Joomla 2.5.8 website from Apache to NGINX 1.4.2
with php-fpm 5.3.3 (and MySQL 5.5.34) on a CentOS 6.4 x86_64 Virtual
Machine (running under KMS).

The goal is to achieve better peak performance: This site has occasional
high peaks; while the normal traffic is ~10 req/sec, it may reach > 3000
req/sec for periods of a few hours (due to the type of services the site
provides - it is a non-profit, real-time seismicity-related site - so
php caching should not be more than 10 seconds).

The new VM (using Nginx) currently is in testing mode and it only has
1-core CPU / 3 GB of RAM. We tested performance with loadimpact and the
results are attached.

You can see at the load graph that as the load approaches 250 clients,
the response time increases very much and is already unacceptable (this
happens consistently). I expected better performance, esp. since caching
is enabled. Despite many efforts, I cannot find the cause of the
bottleneck, and how to deal with it. We would like to achieve better
scaling, esp. since NGINX is famous for its scaling capabilities. Having
very little experience with Nginx, I would like to ask for your
assistance for a better configuration.

When this performance deterioration occurs, we don’t see very high CPU
load (Unix load peaks 2.5), neither RAM exhaustion (System RAM usage
appears to be below 30%). [Monitoring is through Nagios.]

Can you please guide me on how to correct this issue? Any and all
suggestions will be appreciated.

Current configuration, based on info available on the Internet, is as
follows (replaced true domain/website name and public IP address(es)):

=================== Nginx.conf ===================

user nginx;
worker_processes 1;

error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;

worker_rlimit_nofile 200000;

events {
worker_connections 8192;
multi_accept on;
use epoll;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
server_names_hash_bucket_size 64;

 log_format  main  '$remote_addr - $remote_user [$time_local]

“$request” ’
'$status $body_bytes_sent “$http_referer” ’
‘"$http_user_agent" “$http_x_forwarded_for”’;

 log_format cache  '$remote_addr - $remote_user [$time_local]

“$request” ’
'$status $upstream_cache_status $body_bytes_sent
“$http_referer” ’
‘"$http_user_agent" “$http_x_forwarded_for”’;

 fastcgi_cache_path /var/cache/nginx levels=1:2

keys_zone=microcache:5m max_size=1000m;

 access_log  /var/log/nginx/access.log  main;

 sendfile           on;

 tcp_nopush         on;
 tcp_nodelay        on;
 keepalive_timeout  2;

 types_hash_max_size 2048;
 server_tokens off;

 keepalive_requests 30;

 open_file_cache max=5000 inactive=20s;
 open_file_cache_valid 30s;
 open_file_cache_min_uses 2;
 open_file_cache_errors on;

 gzip on;
 gzip_static on;
 gzip_disable "msie6";
 gzip_http_version 1.1;
 gzip_vary on;
 gzip_comp_level 6;
 gzip_proxied any;
 gzip_types text/plain text/css application/json

application/x-javascript text/xml application/xml application/xml+rss
text/javascript application/javascript text/x-js;
gzip_buffers 16 8k;

 include /etc/nginx/conf.d/*.conf;

}

==================================================

================ website config ==================

server {
listen 80;
server_name www.example.com;
access_log /var/webs/wwwexample/log/access_log main;
error_log /var/webs/wwwexample/log/error_log warn;
root /var/webs/wwwexample/www/;

 index  index.php index.html index.htm index.cgi default.html

default.htm default.php;
location / {
try_files $uri $uri/ /index.php?$args;
}

 location /nginx_status {
    stub_status on;
    access_log   off;
    allow 10.10.10.0/24;
    deny all;
 }

 location ~*

/(images|cache|media|logs|tmp)/.*.(php|pl|py|jsp|asp|sh|cgi)$ {
return 403;
error_page 403 /403_error.html;
}

 location ~ /\.ht {
     deny  all;
 }

 location /administrator {
     allow 10.10.10.0/24;
     deny all;
 }

 location ~ \.php$ {

     # Setup var defaults
     set $no_cache "";
     # If non GET/HEAD, don't cache & mark user as uncacheable for 1

second via cookie
if ($request_method !~ ^(GET|HEAD)$) {
set $no_cache “1”;
}
# Drop no cache cookie if need be
# (for some reason, add_header fails if included in prior
if-block)
if ($no_cache = “1”) {
add_header Set-Cookie “_mcnc=1; Max-Age=2; Path=/”;
add_header X-Microcachable “0”;
}
# Bypass cache if no-cache cookie is set
if ($http_cookie ~* “_mcnc”) {
set $no_cache “1”;
}
# Bypass cache if flag is set
fastcgi_no_cache $no_cache;
fastcgi_cache_bypass $no_cache;
fastcgi_cache microcache;
fastcgi_cache_key $scheme$host$request_uri$request_method;
fastcgi_cache_valid 200 301 302 10s;
fastcgi_cache_use_stale updating error timeout invalid_header
http_500;
fastcgi_pass_header Set-Cookie;
fastcgi_pass_header Cookie;
fastcgi_ignore_headers Cache-Control Expires Set-Cookie;

     try_files $uri =404;
     include /etc/nginx/fastcgi_params;
     fastcgi_param PATH_INFO $fastcgi_script_name;
     fastcgi_intercept_errors on;

     fastcgi_buffer_size 128k;
     fastcgi_buffers 256 16k;
     fastcgi_busy_buffers_size 256k;
     fastcgi_temp_file_write_size 256k;
     fastcgi_read_timeout 240;

     fastcgi_pass unix:/tmp/php-fpm.sock;

     fastcgi_index index.php;
     include /etc/nginx/fastcgi_params;
     fastcgi_param SCRIPT_FILENAME 

$document_root$fastcgi_script_name;

 }

 location ~* \.(ico|pdf|flv)$ {
     expires 1d;
 }

 location ~* \.(js|css|png|jpg|jpeg|gif|swf|xml|txt)$ {
     expires 1d;
 }

}

================= php-fpm.conf ===================
include=/etc/php-fpm.d/*.conf
[global]
pid = /var/run/php-fpm/php-fpm.pid
error_log = /var/log/php-fpm/error.log

daemonize = no

============== php-fpm.d/www.conf ================

[www]
listen = /tmp/php-fpm.sock
listen.allowed_clients = 127.0.0.1
user = nginx
group = nginx

pm = dynamic
pm.max_children = 1024
pm.start_servers = 5
pm.min_spare_servers = 5
pm.max_spare_servers = 35

slowlog = /var/log/php-fpm/www-slow.log

php_flag[display_errors] = off
php_admin_value[error_log] = /var/log/php-fpm/www-error.log
php_admin_flag[log_errors] = on
php_admin_value[memory_limit] = 128M

php_value[session.save_handler] = files
php_value[session.save_path] = /var/lib/php/session

==================================================

================ mysql my.cnf ====================

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
symbolic-links=0
user=mysql

query_cache_limit = 2M
query_cache_size = 200M
query_cache_type=1
thread_cache_size=128
key_buffer = 100M
join_buffer = 2M
table_cache= 150M
sort_buffer= 2M
read_rnd_buffer_size=10M
tmp_table_size=200M
max_heap_table_size=200M

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

==================================================

=============== mysqltuner report ================

MySQLTuner 1.2.0 - Major H. removed_email_address@domain.invalid

-------- General Statistics

[–] Skipped version check for MySQLTuner script
[OK] Currently running supported MySQL version 5.5.34
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics

[–] Status: +Archive -BDB -Federated +InnoDB -ISAM -NDBCluster
[–] Data in MyISAM tables: 9M (Tables: 80)
[–] Data in InnoDB tables: 1M (Tables: 65)
[–] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[–] Data in MEMORY tables: 0B (Tables: 4)
[!!] Total fragmented tables: 66

-------- Security Recommendations

[OK] All database users have passwords assigned

-------- Performance Metrics

[–] Up for: 12h 51m 16s (21K q [0.471 qps], 1K conn, TX: 10M, RX: 1M)
[–] Reads / Writes: 55% / 45%
[–] Total buffers: 694.0M global + 21.4M per thread (151 max threads)
[!!] Maximum possible memory usage: 3.8G (135% of installed RAM)
[OK] Slow queries: 0% (0/21K)
[OK] Highest usage of available connections: 23% (36/151)
[OK] Key buffer size / total MyISAM indexes: 150.0M/5.1M
[OK] Key buffer hit rate: 99.3% (51K cached / 358 reads)
[OK] Query cache efficiency: 80.9% (10K cached / 13K selects)
[OK] Query cache prunes per day: 0
[OK] Sorts requiring temporary tables: 0% (0 temp sorts / 55 sorts)
[OK] Temporary tables created on disk: 8% (5 on disk / 60 total)
[OK] Thread cache hit rate: 98% (36 created / 1K connections)
[OK] Table cache hit rate: 20% (192 open / 937 opened)
[OK] Open file limit used: 0% (210/200K)
[OK] Table locks acquired immediately: 99% (4K immediate / 4K locks)
[!!] Connections aborted: 8%
[OK] InnoDB data size / buffer pool: 1.1M/128.0M

==================================================

Please advise.

Thanks and Regards,
Nick


#2

The ultimate bottleneck in any setup like this is usually raw cpu
power. A single virtual core doesn’t look like it’ll hack it. You’ve
got 35 php processes serving 250 users, and I think it’s just spread a
bit thin.

Apart from adding cores, there are 2 things I’d suggest looking at

  • are you using an opcode cacher? APC ( install via pecl to get the
    latest ) works really well with php in fpm… allocate plenty of memory
    to it too
  • check the bandwidth at the network interface. The usual 100Mbit
    connection can easily get swamped by a graphics rich site - especially
    with 250 concurrent users. If this is a problem, then look at using a
    CDN to ease things.

hth,

Steve


#3

On 11.10.2013 10:18, Steve H. wrote:

The ultimate bottleneck in any setup like this is usually raw cpu
power. A single virtual core doesn’t look like it’ll hack it. You’ve
got 35 php processes serving 250 users, and I think it’s just spread a
bit thin.

Apart from adding cores, there are 2 things I’d suggest looking at

  • are you using an opcode cacher? APC ( install via pecl to get the
    latest ) works really well with php in fpm… allocate plenty of memory
    to it too

APC is sort of deprecated though (at least the opcode cache part) in
favor of zend-opcache which is integrated in php 5.5.

Regards,
Dennis


#4

On 11/10/2013 11:18 πμ, Steve H. wrote:

Apart from adding cores, there are 2 things I’d suggest looking at

  • are you using an opcode cacher? APC ( install via pecl to get the
    latest ) works really well with php in fpm… allocate plenty of
    memory to it too
  • check the bandwidth at the network interface. The usual 100Mbit
    connection can easily get swamped by a graphics rich site - especially
    with 250 concurrent users. If this is a problem, then look at using a
    CDN to ease things.

Thanks for the hints.

The strange thing is that unix load does not seem to be over-strained
when this performance deterioration occurs.

APCu seems to be enabled:

extension = apcu.so
apc.enabled=1
apc.mmap_file_mask=/tmp/apc.XXXXXX

All other params are default.

The network interface is Gigabit and should not be a problem.

We’ll add virtual RAM and cores. *Any other suggestions? *

I wish there were a tool which benchmark/analyze the box and running
services and produce suggestions for all lemp stack config: mysqld, php,
php-fpm, apc, nginx! Some magic would help!!

Thanks,
Nick


#5

Hi Nick,

On Sat, Oct 12, 2013 at 04:47:50PM +0300, Nikolaos M. wrote:

We’ll add virtual RAM and cores. *Any other suggestions? *

did you investigate disk I/O?

I found this to be the limiting factor. If you have shell access and if
it is a Linux machine, you can run ‘top’, ‘dstat’ and ‘htop’ to get an
idea about what is happening. ‘dstat’ gives you disk I/O and network
I/O.

Kind regards,
–Toni++


#6

On 14/10/2013 5:47 μμ, Toni M. wrote:

did you investigate disk I/O?

Hi again,

Thanks for your suggestions (see below on that).

In the meantime, we have increased CPU power to 4 cores and the behavior
of the server is much better.

I found that the server performance was reaching a bottleneck (by
php-fpm) by NOT using microcache, because most pages were returning
codes 303 502 (and these return codes were not included in
fastcgi_cache_valid by default). When I set:

fastcgi_cache_valid 200 301 302 303 502 3s;

then I saw immediate performance gains and drop to unix load down to
almost 0 (from 100 - not a typo -) during load.

I used iostat during a load test and I didn’t see any serious stress on
I/O. The worst (max load) recorded entry is:

==========================================================================================================
avg-cpu: %user %nice %system %iowait %steal %idle
85.43 0.00 12.96 0.38 0.00 1.23

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await
svctm %util
vda 0.00 136.50 0.00 21.20 0.00 1260.00 59.43 1.15 54.25 3.92 8.30
dm-0 0.00 0.00 0.00 157.50 0.00 1260.00 8.00 13.39 85.04 0.53 8.29
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Can you see a serious problem here? (I am not an expert, but, judging
from what I’ve read on the Internet, it should not be bad.)

Now my problem is that there seems to be a limit of performance to
around 1200 req/sec (which is not too bad, anyway), although CPU and
memory is ample during all test. Increasing stress load more than that
(I am using tsung for load testing), results only to increasing
“error_connect_emfile” errors.

See results of a test attached. (100 users arriving per second for 5
minutes (with max 10000 users), each of them hitting the homepage 100
times. Details of the test at the bottom of this mail.)

My research showed that this should be a result of file descriptor
exhaustion, however I could not find the root cause. The following seem
OK:

cat /proc/sys/fs/file-max

592940

ulimit -n

200000

ulimit -Hn

200000

ulimit -Sn

200000

grep nofile /etc/security/limits.conf

    • nofile 200000

Could you please guide me on how to resolve this issue? What is the real
bottleneck here and how to overcome?

My config remains as was initially posted (it can also be seen here:
https://www.ruby-forum.com/topic/4417776), with the difference of:
“worker_processes 4” (since we now have 4 CPU cores).

Please advise.

============================= tsung.xml

<?xml version="1.0"?>

============================== tsung.xml

Thanks and Regards,
Nick


#7

On Oct 16, 2013, at 10:07 AM, Nikolaos M. removed_email_address@domain.invalid wrote:

  1. If not, how can we safeguard the web server by setting a suitable
    limit which cannot be surpassed to cause performance deterioration?

Have you considered not having vastly more worker processes than you
have cores? (IIRC, you have configured things that way…)


Scott R.
removed_email_address@domain.invalid
http://www.elevated-dev.com/
(303) 722-0567 voice


#8

On 16/10/2013 1:32 μμ, Nikolaos M. wrote:

Now my problem is that there seems to be a limit of performance…

Increasing stress load more than that (I am using tsung for load
testing), results only to increasing “error_connect_emfile” errors.

I have been trying to resolve this behavior and I increased file
descriptors to 400.000:

# ulimit -n
400000

since:

# cat /proc/sys/fs/file-max
592940

Now, I am running the following test: X number of users per sec visit
the homepage and each one of them refreshes the page 4 times (at random
intervals).

Although the test scales OK until 500 users per sec, then
“error_connect_emfile” errors start again and performance deteriorates.
See the attached comparative chart.

So, I have two questions:

  1. Is there a way we can tweak settings to make the web server scale
    gracefully up to the limit of its resources (and not deteriorate
    performance) as load increases? Can we leverage additional RAM (the
    box always uses up to 3.5 GB RAM, despite the load, and despite the
    fact that the VM now has 6 GB)?
  2. If not, how can we safeguard the web server by setting a suitable
    limit which cannot be surpassed to cause performance deterioration?

Please advise.

Thanks and regards,
Nick


#9

On Oct 16, 2013, at 10:16 AM, Nikolaos M. removed_email_address@domain.invalid wrote:

I have (4 CPU cores and):

worker_processes 4;
worker_rlimit_nofile 400000;

events {
worker_connections 8192;
multi_accept on;
use epoll;
}

Then I have confused this thread with a different one. Sorry for the
noise.


Scott R.
removed_email_address@domain.invalid
http://www.elevated-dev.com/
(303) 722-0567 voice


#10

On 16/10/2013 7:07 μμ, Nikolaos M. wrote:

Although the test scales OK until 500 users per sec, then
“error_connect_emfile” errors start again and performance
deteriorates. See the attached comparative chart.

I resolved the “error_connect_emfile” errors by increasing the file
descriptors on the tsung machine. However, the behavior remains the same
(although no errors occur). I suspect that the problem may not be on the
nginx side but on the tsung box side: the latter may be unable to
generate a higher number of requests and handle the load.

So, I think this case might be considered “closed” until further testing
confirms findings (or rejects them).

Regards,
Nick


#11

On 16/10/2013 7:10 μμ, Scott R. wrote:

Have you considered not having vastly more worker processes than you have cores?
(IIRC, you have configured things that way…)

I have (4 CPU cores and):

worker_processes 4;
worker_rlimit_nofile 400000;

events {
worker_connections 8192;
multi_accept on;
use epoll;
}

Any ideas will be appreciated!

Nick


#12

Hi Nikolaos,

just a small follow-up on this. In your initial mail you stated

The new VM (using Nginx) currently is in testing mode and it only has
1-core CPU

as well as

When this performance deterioration occurs, we don’t see very high
CPU
load (Unix load peaks 2.5)

These numbers already tell you that your initial tests were CPU bound. A
simple way to describe the situation would be that you have loaded your
system with 2.5 as much as it was able to handle “simultaneously”. On
average, 1.5 processes were in the run queue of the scheduler just
“waiting” for a slice of CPU time.

In this configuration, you observed

You can see at the load graph that as the load approaches 250
clients,
the response time increases very much and is already unacceptable

Later on, you wrote

In the meantime, we have increased CPU power to 4 cores and the
behavior
of the server is much better.

and

Now my problem is that there seems to be a limit of performance to
around 1200 req/sec

Do you see that the rate increased by about factor 4? No coincidence, I
think these numbers clarify where the major bottleneck was in your
initial setup.

Also, there was this part of the discussion:

On 16/10/2013 7:10 μμ, Scott R. wrote:

Have you considered not having vastly more worker processes than you
have cores? (IIRC, you have configured things that way…)

I have (4 CPU cores and):

worker_processes 4;

Obviously, here you also need to consider the PHP-FPM and possibly other
processes involved in your web stack.

Eventually, what you want at all times is to have a load average below
the actual number of cores in your machine (N) , because you want your
machine to stay responsive, at least to internal events.

If you run more processes than N that potentially create huge CPU load,
the load average is easily pushed beyond this limit. Via a large request
rate, your users can then drive your machine to its knees. If you don’t
spawn more than N worker processes in the first place, this helps
already a lot in preventing such a user-driven lockup situation.

Cheers,

Jan-Philip