Worker cpu balance

Hi all,

during the testing with the donated 10GB network cards of Myricom to the
haproxy project http://haproxy.1wt.eu/ I have asked the author of this
nice peace of SW if he will be so pleasent to run a test with nginx
instead of tux. He was ;-))

Here the description of his test with haproxy
HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer and now what he have send me back from
his tests with nginx.


It works fast. Since it uses sendfile, it is as fast as Tux on large
files (>= 1MB), and saturates 10 Gbps with 10% of CPU with 1MB files.

However, it does not scale on multiple CPUs, whatever the number of
worker_processes. I’ve tried 1, 2, 8, … The processes are quite there,
but something’s preventing them from sharing a resource since the
machine never goes beyond 50% CPU used (it’s a dual core). Sometimes,
“top” looks like this :

Tasks: 189 total, 3 running, 186 sleeping, 0 stopped, 0 zombie
Cpu0 : 40.3% user, 55.2% system, 0.0% nice, 4.5% idle, 0.0%
IO-wait
Cpu1 : 2.7% user, 1.3% system, 0.0% nice, 96.0% idle, 0.0%
IO-wait
Mem: 2072968k total, 92576k used, 1980392k free, 11604k buffers
Swap: 0k total, 0k used, 0k free, 25656k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ Command
1984 nobody 20 0 2980 996 492 S 34.7 0.0 0:49.85 nginx.bin
1986 nobody 20 0 2980 992 488 S 34.7 0.0 0:51.91 nginx.bin
1980 nobody 20 0 2980 996 492 S 25.8 0.0 0:47.29 nginx.bin
1983 nobody 20 0 2980 996 492 S 2.0 0.0 0:48.07 nginx.bin
1988 nobody 20 0 2980 996 492 R 2.0 0.0 0:45.75 nginx.bin

Sometime it looks like this :

Tasks: 188 total, 2 running, 186 sleeping, 0 stopped, 0 zombie
Cpu0 : 12.7% user, 12.7% system, 0.0% nice, 74.6% idle, 0.0%
IO-wait
Cpu1 : 32.4% user, 39.4% system, 0.0% nice, 28.2% idle, 0.0%
IO-wait
Mem: 2072968k total, 92820k used, 1980148k free, 11604k buffers
Swap: 0k total, 0k used, 0k free, 25660k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ Command
1985 nobody 20 0 2980 996 492 R 53.7 0.0 0:48.14 nginx.bin
1982 nobody 20 0 2980 992 488 S 31.8 0.0 0:39.40 nginx.bin
1986 nobody 20 0 2980 992 488 S 8.0 0.0 0:54.71 nginx.bin
1988 nobody 20 0 2980 996 492 S 5.0 0.0 0:48.79 nginx.bin
1983 nobody 20 0 2980 996 492 S 2.0 0.0 0:52.20 nginx.bin

Rather strange.

I have seen the same behaviour.

Here the description of my setup:

cat /proc/cpuinfo of both mashines

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core™2 CPU 6600 @ 2.40GHz
stepping : 6
cpu MHz : 2400.075
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc arch_perfmon pebs bts pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips : 4802.73
clflush size : 64

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core™2 CPU 6600 @ 2.40GHz
stepping : 6
cpu MHz : 2400.075
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm
constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr lahf_lm
bogomips : 4800.13
clflush size : 64

free -m ( no swap usage )
total used free shared buffers
cached
Mem: 2027 1132 895 0 155
828

When I run ab I get the follwoing:

close:
ab -n 40000 -c 2500 http://192.168.1.17:8080/10k

Document Length: 10240 bytes
Concurrency Level: 2500
Complete requests: 40000

Server Software: lighttpd/1.5.0
Time taken for tests: 7.807093 seconds
Requests per second: 5123.55 [#/sec] (mean)

Server Software: nginx/0.6.29
Time taken for tests: 8.96004 seconds
Requests per second: 4940.71 [#/sec] (mean)

lighttpd use both workers with similar usage.

nginx use both workers but one is used ~70-80 and the other only ~20-30
percent.

keep alive:
ab -n 40000 -c 2500 -k http://192.168.1.17:8080/10k

Server Software: lighttpd/1.5.0
Time taken for tests: 6.625588 seconds
Requests per second: 6037.20 [#/sec] (mean)

Server Software: nginx/0.6.29
Time taken for tests: 6.732870 seconds
Requests per second: 5941.00 [#/sec] (mean)

lighttpd use both workers but not similar.
nginx use only one worker.

Well with a 10k file it’s easy but what is the behaviour with a 1M file?

ab -n 4000 -c 250 -k http://192.168.1.17:8080/1M

Document Length: 1048576 bytes
Concurrency Level: 250
Complete requests: 4000

Server Software: lighttpd/1.5.0
Time taken for tests: 59.870157 seconds
Keep-Alive requests: 3909
Requests per second: 66.81 [#/sec] (mean)

Server Software: nginx/0.6.29
Time taken for tests: 59.899784 seconds
Keep-Alive requests: 4000
Requests per second: 66.78 [#/sec] (mean)

lighttpd use both workers with similar usage.
nginx use again only one the first worker.

A different picture is shown when I use inject
http://1wt.eu/tools/inject/

—nginx
Clients : 5499
Hits : 178202 + 0
abortésOctets : 3863054678
Duree : 61014 ms
Debit : 63314 kB/s
Reponse : 2920 hits/s
Erreurs : 0
Timeouts: 0
Temps moyen de hit: 1729965.8 ms
Temps moyen d’une page complete: 9067.0 ms
Date de demarrage: 1208046930 (13 Avr 2008 - 2:35:30)
Ligne de commande : ./inject29 -l -n 40000 -p 2 -o 8 -u 2500 -s 20 -G
192.168.1.17:8080/1M -d 60

—lighty
Clients : 5000
Hits : 57310 + 0
abortésOctets : 4022907712
Duree : 61010 ms
Debit : 65938 kB/s
Reponse : 939 hits/s
Erreurs : 0
Timeouts: 0
Temps moyen de hit: 0.0 ms
Temps moyen d’une page complete: 0.0 ms
Date de demarrage: 1208047028 (13 Avr 2008 - 2:37:08)
Ligne de commande : ./inject29 -l -n 40000 -p 2 -o 8 -u 2500 -s 20 -G
192.168.1.17:8080/1M -d 60

with this testing tool both servers distribute the workers is similar
manner.

The first worker get always the most ‘load’.

Have anybody seen the same behaviour in the real world or happen this
only at test time?

You can get both config files from
http://none.at/lighttpd.conf
http://none.at/nginx.conf

BR

Aleks

Hi

However, it does not scale on multiple CPUs, whatever the number of
worker_processes. I’ve tried 1, 2, 8, … The processes are quite there,
but something’s preventing them from sharing a resource since the
machine never goes beyond 50% CPU used (it’s a dual core). Sometimes,
“top” looks like this :
nginx use both workers but one is used ~70-80 and the other only ~20-30
percent.

This is just a guess since I haven’t looked at the code

I read that nginx is event based, hence it probably is designed with a
single process reading and writing to all sockets (correct?). So in
your case perhaps the bottleneck is reading from the socket and this is
maxing out the single CPU? I would guess that the worker processes only
split the load of any processing downstream of the network socket
handling step…?

OK, that’s my guess, perhaps the author will now explain the real
problem…

Ed W

Aleksandar L. wrote:


It works fast. Since it uses sendfile, it is as fast as Tux on large
files (>= 1MB), and saturates 10 Gbps with 10% of CPU with 1MB files.

However, it does not scale on multiple CPUs, whatever the number of
worker_processes. I’ve tried 1, 2, 8, … The processes are quite there,
but something’s preventing them from sharing a resource since the
machine never goes beyond 50% CPU used (it’s a dual core).

Was worker_cpu_affinity defined in the config file to ensure each worker
was on a particular CPU?

On Son 13.04.2008 10:19, Renaud Allard wrote:

Aleksandar L. wrote:

However, it does not scale on multiple CPUs, whatever the number of
worker_processes. I’ve tried 1, 2, 8, … The processes are quite
there, but something’s preventing them from sharing a resource since
the machine never goes beyond 50% CPU used (it’s a dual core).

Was worker_cpu_affinity defined in the config file to ensure each
worker was on a particular CPU?

No.

Aleks

On Sun, Apr 13, 2008 at 03:04:20AM +0200, Aleksandar L. wrote:

Tasks: 189 total, 3 running, 186 sleeping, 0 stopped, 0 zombie
1988 nobody 20 0 2980 996 492 R 2.0 0.0 0:45.75 nginx.bin
1985 nobody 20 0 2980 996 492 R 53.7 0.0 0:48.14 nginx.bin
1982 nobody 20 0 2980 992 488 S 31.8 0.0 0:39.40 nginx.bin
1986 nobody 20 0 2980 992 488 S 8.0 0.0 0:54.71 nginx.bin
1988 nobody 20 0 2980 996 492 S 5.0 0.0 0:48.79 nginx.bin
1983 nobody 20 0 2980 996 492 S 2.0 0.0 0:52.20 nginx.bin

Try

events {
accept_mutex off;

Aleksandar L. wrote:

worker was on a particular CPU?

No.

So perhaps it is a good idea to make the test again with cpu affinity
set to see if the results are the same.

On Son 13.04.2008 10:42, Renaud Allard wrote:

So perhaps it is a good idea to make the test again with cpu affinity
set to see if the results are the same.

I don’t think so due the fact that the second worker isn’t use, what do
you think?

Cheers

Aleks

On Son 13.04.2008 12:36, Igor S. wrote:

On Sun, Apr 13, 2008 at 03:04:20AM +0200, Aleksandar L. wrote:

It works fast. Since it uses sendfile, it is as fast as Tux on large
files (>= 1MB), and saturates 10 Gbps with 10% of CPU with 1MB files.

However, it does not scale on multiple CPUs, whatever the number of
worker_processes. I’ve tried 1, 2, 8, … The processes are quite
there, but something’s preventing them from sharing a resource since
the machine never goes beyond 50% CPU used (it’s a dual
core). Sometimes, “top” looks like this :
[snipp]
Try

events {
accept_mutex off;

Thanks but not so much changes ;-(, any further tuning options?

With mutex off 4925
Cpu0 : 1.0%us, 2.0%sy, 0.0%ni, 74.0%id, 0.0%wa, 0.0%hi, 23.0%si,
0.0%st
Cpu1 : 3.0%us, 2.0%sy, 0.0%ni, 70.0%id, 0.0%wa, 1.0%hi, 24.0%si,
0.0%st
Mem: 2075780k total, 2053376k used, 22404k free, 4420k buffers
Swap: 4096532k total, 192k used, 4096340k free, 1860656k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1969 al 15 0 3364 1564 540 S 12 0.1 0:20.82 nginx
1968 al 15 0 3364 1604 544 R 0 0.1 0:31.64 nginx

Without mutex settings 4976
Cpu0 : 0.4%us, 3.2%sy, 0.0%ni, 93.2%id, 0.3%wa, 0.2%hi, 2.8%si,
0.0%st
Cpu1 : 0.5%us, 4.6%sy, 0.0%ni, 91.6%id, 0.2%wa, 0.2%hi, 2.9%si,
0.0%st
Mem: 2075780k total, 2046476k used, 29304k free, 4464k buffers
Swap: 4096532k total, 192k used, 4096340k free, 1860656k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2016 al 15 0 3308 1532 540 S 23 0.1 0:10.15 nginx
2015 al 15 0 3308 1520 544 S 6 0.1 0:01.08 nginx

Could be the scheduler a point for optimization, or am I on the wrong
way?!

lighty 5144

Cpu0 : 7.0%us, 5.0%sy, 0.0%ni, 66.0%id, 0.0%wa, 3.0%hi, 19.0%si,
0.0%st
Cpu1 : 12.0%us, 5.0%sy, 0.0%ni, 62.0%id, 0.0%wa, 3.0%hi, 18.0%si,
0.0%st
Mem: 2075780k total, 2005220k used, 70560k free, 4484k buffers
Swap: 4096532k total, 192k used, 4096340k free, 1866996k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2032 root 15 0 36508 1768 900 S 24 0.1 0:02.36 lighttpd
2031 root 15 0 36512 1716 900 S 19 0.1 0:01.40 lighttpd

I’am ready to help to improve nginx but for know it looks to me that
lighty 1.5 is a little bit faster.

How ever I stay to nginx, I like it more ;-))

Cheers

Aleks

On Son 13.04.2008 20:40, Igor S. wrote:

On Sun, Apr 13, 2008 at 06:13:19PM +0200, Aleksandar L. wrote:

Thanks but not so much changes ;-(, any further tuning options?

[snipp]

Could be the scheduler a point for optimization, or am I on the wrong
way?!

If accept_mutex is off, then OS scheduler choose process to accept a
new connection.

Hm, is there a easy way to find out if the linux scheduler is the
problem?!

I will try to run nginx and light on a solaris x86 maschine.

How ever I stay to nginx, I like it more ;-))

In last top shot lighttpd eats more CPU as compared to nginx.
This may be the cause, why lighty processes are balanced better.

Hm, which tool can help us to find the reason, is top enough?

It’s strange that lighty eats so much CPU although it has
server.network-backend = “linux-sendfile”

I usually see the equal load on 2CPU FreeBSD 7 (low weekend load):

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11224 nobody 1 4 -10 104M 101M kqread 0 336:12 10.79% nginx
11225 nobody 1 4 -10 108M 103M kqread 0 338:12 10.60% nginx

Hm, have you some ideas how can I tune nginx to reach the hits/s from
lighty?!

Cheers

Aleks

Thanks but not so much changes ;-(, any further tuning options?

With mutex off 4925
Cpu0 : 1.0%us, 2.0%sy, 0.0%ni, 74.0%id, 0.0%wa, 0.0%hi, 23.0%si,
0.0%st
Cpu1 : 3.0%us, 2.0%sy, 0.0%ni, 70.0%id, 0.0%wa, 1.0%hi, 24.0%si,
0.0%st
Mem: 2075780k total, 2053376k used, 22404k free, 4420k buffers
Swap: 4096532k total, 192k used, 4096340k free, 1860656k cached

The problem the haproxy guy was observing was that he couldn’t get more
than
100% CPU usage on a multi-cpu machine (eg with 2 CPUs, you should be
able to
get to 200%). The above data shows that the machine has over 100% idle
time,
so you’re not stressing it enough to prove that the change hasn’t
helped.

So your tests are showing something completely different to the tests
run by
the haproxy guy. His were able to saturate the CPU, but yours clearly
aren’t
saturating the CPU at all. Also your %si figure is quite high, which
means
that the CPU is servicing a lot of software interrupts. I wonder if
that’s
actually the limitation here, some BKL issue because so much time is in
a
serialised part of the kernel. I’m guessing you’re not using the best
network card and driver for your tests.

Rob

On Sun, Apr 13, 2008 at 06:13:19PM +0200, Aleksandar L. wrote:

core). Sometimes, “top” looks like this :
Cpu0 : 1.0%us, 2.0%sy, 0.0%ni, 74.0%id, 0.0%wa, 0.0%hi, 23.0%si,

2015 al 15 0 3308 1520 544 S 6 0.1 0:01.08 nginx

Could be the scheduler a point for optimization, or am I on the wrong
way?!

If accept_mutex is off, then OS scheduler choose process to accept
a new connection.

2031 root 15 0 36512 1716 900 S 19 0.1 0:01.40 lighttpd

I’am ready to help to improve nginx but for know it looks to me that
lighty 1.5 is a little bit faster.

How ever I stay to nginx, I like it more ;-))

In last top shot lighttpd eats more CPU as compared to nginx.
This may be the cause, why lighty processes are balanced better.
It’s strange that lighty eats so much CPU although it has
server.network-backend = “linux-sendfile”

I usually see the equal load on 2CPU FreeBSD 7 (low weekend load):

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU
COMMAND
11224 nobody 1 4 -10 104M 101M kqread 0 336:12 10.79%
nginx
11225 nobody 1 4 -10 108M 103M kqread 0 338:12 10.60%
nginx

On Mon 14.04.2008 09:31, Rob M. wrote:

Thanks but not so much changes ;-(, any further tuning options?

The problem the haproxy guy was observing was that he couldn’t get
more than 100% CPU usage on a multi-cpu machine (eg with 2 CPUs, you
should be able to get to 200%). The above data shows that the machine
has over 100% idle time, so you’re not stressing it enough to prove
that the change hasn’t helped.

Yep, that’s the reason why I try to use a older/slower maschine.

So your tests are showing something completely different to the tests
run by the haproxy guy. His were able to saturate the CPU, but yours
clearly aren’t saturating the CPU at all. Also your %si figure is
quite high, which means that the CPU is servicing a lot of software
interrupts. I wonder if that’s actually the limitation here, some BKL
issue because so much time is in a serialised part of the kernel. I’m
guessing you’re not using the best network card and driver for your
tests.

Thanks for input I’ll come back as soon as I have changed the maschine.

Cheers

Aleks