Re: Surviving Digg?

Rt_Ibmer · April 29, 2008, 11:39pm

open() “/var/www/html/images/imagefile.jpg” failed (24: Too many open files)
Running ulimit -n showed 1024, so set that to 32768 on all 3 servers.
Also raised limit in /etc/security/limits.conf.

Congrats on the digg, I think

Seems like you got that part under control now as far as the file
descriptors. You may want to raise your worker_connections value. I
would also sure nginx is seeing the 32768 FDs (because you can set it in
the env but it may not have the setting in its env) by running the error
log at the notice level and watching it as you fire up or reload nginx
config. If you put worker_connections up to like 4096 (even if
temporary) It’ll output to the log a warning if it doesn’t have access
to more than 1024 FDs - with the default worker_connections at 1024 it
will not…

Now, we started seeing the following:
upstream timed out (110: Connection timed out) while connecting to upstream
So, perhaps the 2 backend servers couldn’t handle the load? We were

That’s what it would seem to me. What is your proxy_connect_timeout set
to? If not set I think off the top of my head it defaults to 60s. That
is a long time for the backend servers not to complete at any volume
level.

I would look hard at your upstream servers as it seems nginx may have
been doing its job fur the upstreams could not keep up. Perhaps there
is a network, db, or app level performance issue to be addressed. Or
depending on the level of traffic you simply needed 5x or 10x the number
of instances to handle the load.

We ended up rebooting both of the backend servers, and these errors
stopped.

Well that is interesting. Perhaps the backend servers were ran out of
resources and pushed to the point of no return. At any rate I’d
recommend using some stress testing tools and trying to reproduce and
watch what happens on the upstream boxes. Probably will be quite
revealing. HTH.

  ____________________________________________________________________________________

Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now.
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Rt_Ibmer · April 30, 2008, 1:19am

On Tue, Apr 29, 2008 at 2:26 PM, Rt Ibmer [email protected] wrote:

open() “/var/www/html/images/imagefile.jpg” failed (24: Too many open files)
Running ulimit -n showed 1024, so set that to 32768 on all 3 servers.
Also raised limit in /etc/security/limits.conf.

Congrats on the digg, I think

Seems like you got that part under control now as far as the file descriptors. You may want to raise your worker_connections value. I would also sure nginx is seeing the 32768 FDs (because you can set it in the env but it may not have the setting in its env) by running the error log at the notice level and watching it as you fire up or reload nginx config. If you put worker_connections up to like 4096 (even if temporary) It’ll output to the log a warning if it doesn’t have access to more than 1024 FDs - with the default worker_connections at 1024 it will not…

Ok, tried that, no notice printed out. What is a good value for
worker_connections?

Now, we started seeing the following:
upstream timed out (110: Connection timed out) while connecting to upstream

So, perhaps the 2 backend servers couldn’t handle the load? We were

That’s what it would seem to me. What is your proxy_connect_timeout set to? If not set I think off the top of my head it defaults to 60s. That is a long time for the backend servers not to complete at any volume level.

proxy_connect_timeout 90;
proxy_send_timeout 90;
proxy_read_timeout 90;

Are these reasonable?

I would look hard at your upstream servers as it seems nginx may have been doing its job fur the upstreams could not keep up. Perhaps there is a network, db, or app level performance issue to be addressed. Or depending on the level of traffic you simply needed 5x or 10x the number of instances to handle the load.

Yup, just trying to figure out what exactly it is that we need! We’ve
hit the front page of digg before, though not as big last time.

We ended up rebooting both of the backend servers, and these errors
stopped.

Well that is interesting. Perhaps the backend servers were ran out of resources and pushed to the point of no return. At any rate I’d recommend using some stress testing tools and trying to reproduce and watch what happens on the upstream boxes. Probably will be quite revealing. HTH.

Thanks for the help!