Performance issue.. after a while

Bogdan_I · February 25, 2006, 9:34pm

Hello,

I have an project running on a dedicated server:
Debian, P4 CPU 3.00GHz, 1GB RAM,
ruby 1.8.4 (2005-12-24) [x86_64-linux],
rails (1.0.0), activerecord (1.13.2)
lighttpd-1.4.10 + fastcgi + mysql 5.0
7 dispatchers.

The project is a game, so a typical user would visit 100+ pages.
When the server is busiest, it gets 35-40k requests/hour.

For some misterious reason after a number of hours the whole thing
starts
moving slower, typically the server load goes up to 5-8 and I know that
I
have to either start killing dispatch.fcgi processes, or simply restart
the
whole thing.
It is definitely not the fact that the server cannot deal the number of
requests. It appears that some of the dispatch.fcgi processes simply
bring
the server to a semihalt. Killing the culprit makes the load go under 1%
and
the game itself several times faster. The problem is that I never know
which
one is the one causing the problems.
I have attempted to find and fix memory leaks, I have removed rmagick
from
file_column since it was said that rmagick was causing leaks;
I have removed the unnecessary services, I am keeping the lighttpd
configuration to a minimum, yet, I pretty much have to restart the
server
daily.
Are there any special tricks that have to be done to have the
dispatchers
behave? And maybe to use less RAM?
Any suggestions are welcome.

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
avirtual 8177 0.1 0.2 20124 2232 ? S 13:35 0:11
lighttpd -f
/root/lighttpd.conf
avirtual 8178 2.2 11.2 147620 115436 ? R 13:35 2:37
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8179 2.0 14.2 177640 145588 ? S 13:35 2:22
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8180 2.1 13.6 172560 140140 ? S 13:35 2:31
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8181 1.4 0.1 178156 1512 ? S 13:35 1:43
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8182 2.2 9.1 131236 93564 ? R 13:35 2:34
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8184 2.0 14.1 177920 145164 ? S 13:35 2:23
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8186 2.7 13.8 173764 141844 ? S 13:35 3:12
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi

Bogdan_I · February 25, 2006, 10:05pm

I would go through all your code and make sure there are no
possibilities that an infinite loop occurs. Every time I’ve had an
app go fine for a while and then suddenly start crawling, it’s because
some infinite loop that I didn’t notice occurred. In a game I’m sure
there are lots of possibilities for things like this to happen.

Pat

Bogdan_I · February 25, 2006, 11:23pm

Bogdan-

I have had troubles with lighttpd1.4.10 and I am currently running

my apps on either 1.4.8 or 1.4.9. 1.4.9 has been working great for me
on debian specifically. And I got something similar with 1.4.10 where
i got zombied fcgi’s. So try downgrading to 1.4.9 or 1.4.8 and see if
that solves your problem. And are you running your fcgi’s with unix
sockets or over IP:PORTNUM? You might want to run the fcgi’s each on
their own consecutive port numbers as standalone spawn-fcgi’s and let
lighty just load balance between them. This way you can reap them
easily with script/process/reaper. Or you can grep through the ps
awxx | grep dispatch.fcgi results and see which ones are zombied and
kill them and respawn. You could do this in a script.

Cheers-
-Ezra

Bogdan_I · February 25, 2006, 11:39pm

I can’t really blame 1.4.10 for the troubles. I’ve upgraded to 1.4.10
only
several days ago, to have a bug fixed.
I’m having a basic lighttpd.conf which uses unix sockets.
I will look for some documentation/blogs regarding spawning fcgi’s on
different ports.
If you have a configuration file that I could look upon, it would be
great.

Thanks,
Bogdan

Bogdan_I · February 26, 2006, 12:00am

Bogdan-

Here are a few links that might help:

http://jamis.jamisbuck.org/articles/2006/02/11/tip-textdrive-and-
lighttpd
http://techno-weenie.net/svn/projects/misc/spawner/Rakefile
http://poocs.net/articles/2006/02/14/killing-me-softly-keeping-
dispatchers-alive

-Ezra

Bogdan_I · February 26, 2006, 3:30am

On Feb 25, 2006, at 12:32 PM, Bogdan I. wrote:

I have an project running on a dedicated server:
Debian, P4 CPU 3.00GHz, 1GB RAM,
ruby 1.8.4 (2005-12-24) [x86_64-linux],
rails (1.0.0), activerecord (1.13.2)
lighttpd-1.4.10 + fastcgi + mysql 5.0
7 dispatchers.

The project is a game, so a typical user would visit 100+ pages.
When the server is busiest, it gets 35-40k requests/hour.

You’re using caching, right? Judging from your process run times
Rails doesn’t see most of those requests. Your processes should
accumulate several minutes of CPU time if you’re serving nearly a
million requests per day.

For some misterious reason after a number of hours the whole thing
starts moving slower, typically the server load goes up to 5-8 and
I know that I have to either start killing dispatch.fcgi processes,
or simply restart the whole thing.

From your process sizes you’re probably spending most of your time
swapping.

COMMAND
avirtual 8184 2.0 14.1 177920 145164 ? S 13:35 2:23 /
usr/bin/ruby1.8
avirtual 8186 2.7 13.8 173764 141844 ? S 13:35 3:12 /
usr/bin/ruby1.8

Judging from your process times I doubt you need seven fastcgi
processes. It looks like you sent this mail nine hours (at 22:32)
after starting these processes and they’ve each accumulated less than
three minutes of CPU time. Try running just four.

How big is your app when you start it? 130MB to 180MB virtual is
alarmingly large.

–
Eric H. - [email protected] - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com

Bogdan_I · February 26, 2006, 6:07pm

Changing the way the dispatchers are started seems to have generated
immediate results.

Lighttpd.conf before:
fastcgi.server = ( “.fcgi” =>
( “localhost” =>
(“min-procs” => 3, “max-procs” => 5, “socket” =>
“/home/avirtual/railsData/log/fcgi.socket”,
“bin-path” => “/home/avirtual/railsData/public/dispatch.fcgi”,
“bin-environment” => ( “RAILS_ENV” => “production” ) ) ))

Lighttpd.conf now:
fastcgi.server = ( “.fcgi” => ( “localhost” =>
( “socket” =>
“/home/avirtual/railsData/tmp/railsData-0.socket”
),
( “socket” =>
“/home/avirtual/railsData/tmp/railsData-1.socket”
),
( “socket” =>
“/home/avirtual/railsData/tmp/railsData-2.socket”
),
( “socket” =>
“/home/avirtual/railsData/tmp/railsData-3.socket”
),
( “socket” =>
“/home/avirtual/railsData/tmp/railsData-4.socket”
) ) )

I’ve created a spawner described in:
http://jamis.jamisbuck.org/articles/2006/02/11/tip-textdrive-and-lighttpd

Somehow the dispatchers use less RAM and they do not jump to 130M
instantly.
It is obvious that the requests are balanced in a logical way now, and
that
the first dispatchers handle most requests.
avirtual 19450 7.9 5.9 89876 60584 ? S 07:08 23:17
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 19452 1.3 5.6 87108 57848 ? S 07:08 3:57
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 19454 0.2 3.7 67476 38204 ? S 07:08 0:51
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 19456 0.0 3.9 70088 40756 ? S 07:08 0:17
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 19458 0.0 3.9 70160 40824 ? S 07:08 0:07
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi

The performance has been linear during the day. I will know more in a
couple
of hours or tomorrow, but it seems that although I haven’t found any
‘omygod
what a stupid endless loop’ in the code, changing the configuration
helped
more than I could have anticipated.

Now, maybe someone more experienced could try to explain why the
standard
lighttpd configuration was so bad in my case.

Bogdan

Bogdan_I · February 26, 2006, 11:00pm

On Feb 26, 2006, at 9:04 AM, Bogdan I. wrote:

“production” ) ) ))
( “socket” => "/home/avirtual/railsData/tmp/
It is obvious that the requests are balanced in a logical way now,
usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi

Rails mailing list
[email protected]
http://lists.rubyonrails.org/mailman/listinfo/rails

Bogdan-

There is nothing wrong with the rails standard lighty conf that does

the min-procs/max-procs. But ever since lighty 1.3.x somtime the
dynamic spawning has been removed from lighty. So the min-procs
desn’t have any effect at all and lighty will always spawn what you
set max-procs to. But for some unknown reason, the way this works can
get a little weird under heavier load with more fcgi’s. The load
balancing between fcgi’s doesn’t seem to work as well with the min/
max-procs directives and sockets. So like you I have had much better
luck with explicitely listing all fcgi listeners in lighty and using
spawn-fcgi to load the fcgi listeners stand alone.

I have also had really good luck with using IP:PORTNUM listeners for

the fcgi’s instead of sockets. It seems to me that lighty has an
easier time load balancing between listeners when it doesn’t have to
think about it as much and the fcgi’s are each listed explicitely.

I'm glad its running for you. 200-250 MB of ram for each fcgi seems

a bit excessive. My fcgi’s are usually between 25-80MB ram each. But
you are running a game so maybe they are each doing more work and
holding more in memory then I am.

Cheers-
-Ezra

Bogdan_I · February 26, 2006, 10:31am

On 2/26/06, Eric H. [email protected] wrote:

You’re using caching, right? Judging from your process run times
Rails doesn’t see most of those requests. Your processes should
accumulate several minutes of CPU time if you’re serving nearly a
million requests per day.

The content is dynamic with no static pages. The ‘ps’ was about 1 hour
after
killing several processes.
This is how it looks after 15 hours:
(note that in the past 4-5 hours, the server was more or less idle)

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
avirtual 8179 1.7 10.3 209064 105580 ? S Feb25 15:24
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8181 1.5 13.8 174060 142088 ? S Feb25 13:19
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8182 1.8 15.5 192668 159224 ? S Feb25 15:59
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8631 1.6 0.1 180692 1520 ? S Feb25 12:02
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8669 1.8 14.6 183140 149996 ? S Feb25 13:42
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8918 0.0 0.1 129200 1472 ? S Feb25 0:02
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi
avirtual 8927 1.5 14.8 183840 151868 ? S Feb25 10:51
/usr/bin/ruby1.8 /home/avirtual/railsData/public/dispatch.fcgi

Judging from your process times I doubt you need seven fastcgi
processes. It looks like you sent this mail nine hours (at 22:32)
after starting these processes and they’ve each accumulated less than
three minutes of CPU time. Try running just four.

How big is your app when you start it? 130MB to 180MB virtual is
alarmingly large.

Lighttpd was started 2-3 hours before I took the ps. It was not a rush
hour.
Right after starting lighttpd and idle dispatch.fcgi takes 53-63MB of
RAM
The one or two that are active will quickly jump to 131MB.
After that in a matter of hours all of them will jump to 200-220MB.

In the meantime I’ve lowered the number of dispatchers to 5 (though it
seems
the dispatchers are simply attempting to steal as much RAM a possible)
and
compiled ruby on the server.
I am also going to try to spawn fcgi’s as separate processes and see how
it
goes.
Bogdan

Bogdan_I · February 26, 2006, 11:15pm

It’s been 10 hours since I’ve started lighttpd with the new
configuration.
The top dispatch.fcgi uses now 100MB. The other 4 are at 88-90MB.
Plus, no lag at all and the fifth dispatcher barely gets used.
Also the system load is under 1%.
There is something magical in this configuration

Bogdan_I · February 26, 2006, 11:27pm

On Feb 26, 2006, at 2:15 PM, Bogdan I. wrote:

It’s been 10 hours since I’ve started lighttpd with the new
configuration.
The top dispatch.fcgi uses now 100MB. The other 4 are at 88-90MB.
Plus, no lag at all and the fifth dispatcher barely gets used.
Also the system load is under 1%.
There is something magical in this configuration

Cool, thats how lighty should behave ;^)

-Ezra