We hit the front page of digg the other night, and our servers didn't handle it well at all. Here's a little of what happened, and perhaps someone has some suggestions on what to tweak! Basic setup, nginx 0.5.35, serving up static image content, and then passing php requests to 2 backend servers running apache, all running red hat el4. Looking at the nginx error log - First, we saw a lot of entries like the following: socket() failed (24: Too many open files) while connecting to upstream accept() failed (24: Too many open files) while accepting new connection open() "/var/www/html/images/imagefile.jpg" failed (24: Too many open files) Running ulimit -n showed 1024, so set that to 32768 on all 3 servers. Also raised limit in /etc/security/limits.conf. Now, we started seeing the following: upstream timed out (110: Connection timed out) while connecting to upstream So, perhaps the 2 backend servers couldn't handle the load? We were serving the page mostly out of memcache at this point. In any case, couldn't figure out why that wasn't sufficient, so we replaced the page with a static html one. This seemed to help, but we were now seeing a lot of these: connect() failed (113: No route to host) while connecting to upstream no live upstreams while connecting to upstream This wasn't on every request, but a significant percentage. This, we couldn't figure out. Why couldn't it connect to the backend servers? We ended up rebooting both of the backend servers, and these errors stopped. Any thoughts / comments anyone has? Thanks!
on 29.04.2008 22:47
on 29.04.2008 23:18
Hi Neil, On Die 29.04.2008 13:38, Neil Sheth wrote: > >We hit the front page of digg the other night, and our servers didn't >handle it well at all. Here's a little of what happened, and perhaps >someone has some suggestions on what to tweak! > >Basic setup, nginx 0.5.35, serving up static image content, and then >passing php requests to 2 backend servers running apache, all running >red hat el4. What was/is the network settings on the maschines? >Now, we started seeing the following: > upstream timed out (110: Connection timed out) while connecting to >upstream What was the load on the backends? What are the settings of apache? Have you take a looke about netstat -nt how many FIN* things do you have? >So, perhaps the 2 backend servers couldn't handle the load? We were >serving the page mostly out of memcache at this point. In any case, >couldn't figure out why that wasn't sufficient, so we replaced the page >with a static html one. > >This seemed to help, but we were now seeing a lot of these: > connect() failed (113: No route to host) while connecting to upstream > no live upstreams while connecting to upstream Have you put names or ip-addresses into the nginx config? >This wasn't on every request, but a significant percentage. This, we >couldn't figure out. Why couldn't it connect to the backend servers? >We ended up rebooting both of the backend servers, and these errors >stopped. Again load and netstat?! Cheers Aleks
on 30.04.2008 01:12
On Tue, Apr 29, 2008 at 2:07 PM, Aleksandar Lazic <al-nginx@none.at> wrote: > > Basic setup, nginx 0.5.35, serving up static image content, and then > > passing php requests to 2 backend servers running apache, all running > > red hat el4. > > > > What was/is the network settings on the maschines? What specific settings are you asking about? > Have you take a looke about > > netstat -nt > > how many FIN* things do you have? Right now, shows about 60. Not sure what the count of FIN objects was at the time of the digg. I did run the following (found in a forum somewhere, to give connection counts by IP): netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr This showed the number of connections to the backend servers to be almost 1000 each. > > no live upstreams while connecting to upstream > > > > Have you put names or ip-addresses into the nginx config? IP addresses > > > > This wasn't on every request, but a significant percentage. This, we > > couldn't figure out. Why couldn't it connect to the backend servers? > > We ended up rebooting both of the backend servers, and these errors > > stopped. > > > > Again load and netstat?! Load didn't actually look that bad, if I recall. Probably peaks around 4 while this was occuring, but generally lower. > Cheers > > Aleks > > Thanks for the help!
on 30.04.2008 09:23
On Tue, Apr 29, 2008 at 01:38:13PM -0700, Neil Sheth wrote: > First, we saw a lot of entries like the following: > socket() failed (24: Too many open files) while connecting to upstream > accept() failed (24: Too many open files) while accepting new connection > open() "/var/www/html/images/imagefile.jpg" failed (24: Too many open files) > > Running ulimit -n showed 1024, so set that to 32768 on all 3 servers. > Also raised limit in /etc/security/limits.conf. You need to tune your OS: to increase number of files, sockets, etc. I can not say about Linux, but here is my tunning for FreeBSD/amd64, 4G for large number of sockets/etc: http://lists.freebsd.org/pipermail/freebsd-net/2008-April/017737.html > Now, we started seeing the following: > upstream timed out (110: Connection timed out) while connecting to upstream > > So, perhaps the 2 backend servers couldn't handle the load? We were > serving the page mostly out of memcache at this point. In any case, > couldn't figure out why that wasn't sufficient, so we replaced the > page with a static html one. Yes, it seems that your backend can not handle load. > This seemed to help, but we were now seeing a lot of these: > connect() failed (113: No route to host) while connecting to upstream > no live upstreams while connecting to upstream > > This wasn't on every request, but a significant percentage. This, we > couldn't figure out. Why couldn't it connect to the backend servers? > We ended up rebooting both of the backend servers, and these errors > stopped. > > Any thoughts / comments anyone has? Thanks! The "113: No route to host" is network error, it might be appeared while backend rebooting.
on 30.04.2008 11:34
If using linux. Put the following line (WITHOUT quotes) "* hard nofile 8024" in the /etc/security/limits.conf and reboot the server. - (Of course you can do it without rebooting). Or, put the following in nginx init file (like, /etc/init.d/nginx) before the daemon start line.. in start function. ulimit -n 8024 and just restart the nginx server. That will solve the problem. But beware. Your limit now is 8000 of open files on system. Google it and tweak it if needed. Kind Regards, Sasa Ugrenovic On Wed, 30 Apr 2008 11:08:52 +0400
on 01.05.2008 22:38
On Die 29.04.2008 16:01, Neil Sheth wrote: >> > >> > Basic setup, nginx 0.5.35, serving up static image content, and then >> > passing php requests to 2 backend servers running apache, all running >> > red hat el4. >> > >> >> What was/is the network settings on the maschines? > >What specific settings are you asking about? sysctl net.ipv4.tcp_fin_timeout sysctl net.ipv4.tcp_tw_recycle
on 02.05.2008 02:17
On Thu, May 1, 2008 at 1:25 PM, Aleksandar Lazic <al-nginx@none.at> wrote: > > > > > > > > > > > What specific settings are you asking about? > > > > sysctl net.ipv4.tcp_fin_timeout 60 > sysctl net.ipv4.tcp_tw_recycle 0 Are these unreasonable? Thanks!
on 03.05.2008 11:28
On Don 01.05.2008 17:08, Neil Sheth wrote: >On Thu, May 1, 2008 at 1:25 PM, Aleksandar Lazic <al-nginx@none.at> wrote: > >> sysctl net.ipv4.tcp_fin_timeout >60 > >> sysctl net.ipv4.tcp_tw_recycle >0 > >Are these unreasonable? Thanks! Here are some tips from another list: http://www.formilux.org/archives/haproxy/0711/0207.html Main thing is do you use iptabels with conntrack? Hth Aleks
on 06.05.2008 04:49
Thanks, going through this. To be honest, not something I know much about., but learning. Iptables with conntrack? Looking here: http://www.kalamazoolinux.org/presentations/20010417/conntrack.html I do have entries in my iptables with params like --state NEW . . .
on 06.05.2008 09:06
On Mon, May 05, 2008 at 07:39:56PM -0700, Neil Sheth wrote: > Thanks, going through this. To be honest, not something I know much > about., but learning. > > Iptables with conntrack? Looking here: > http://www.kalamazoolinux.org/presentations/20010417/conntrack.html > > I do have entries in my iptables with params like --state NEW . . . Disabling conntrack is especially useful when you want your router to survive a DDoS :) If you have conntrack enabled (state, conn*, helper and probably many other matches; also _anything_ in the nat table), every connection eats a few bytes of precious (on 32-bit) kernel low memory. The amount of memory used is limited but after it is reached, new connections are dropped. If you only use --state NEW, for TCP the match '-p tcp --syn' should be equivalent. Best regards, Grzegorz Nosek
on 06.05.2008 18:56
On Mon 05.05.2008 19:39, Neil Sheth wrote: >Thanks, going through this. To be honest, not something I know much >about., but learning. > >Iptables with conntrack? Looking here: >http://www.kalamazoolinux.org/presentations/20010417/conntrack.html > >I do have entries in my iptables with params like --state NEW . . . Ok what happen when you deliver dirctly from disc instead of memcached? What shows memcached logs, if there any, I haven't used it for a long time? Aleks
on 06.05.2008 19:39
On 5/5/08, Grzegorz Nosek <grzegorz.nosek@gmail.com> wrote: > Disabling conntrack is especially useful when you want your router to > survive a DDoS :) > > If you have conntrack enabled (state, conn*, helper and probably many > other matches; also _anything_ in the nat table), every connection eats > a few bytes of precious (on 32-bit) kernel low memory. The amount of > memory used is limited but after it is reached, new connections are > dropped. > > If you only use --state NEW, for TCP the match '-p tcp --syn' should be > equivalent. Not only that, but if you don't specifically disable connection tracking, things over the loopback get dumped into the state table by default. Ugh! http://cactuswax.net/articles/ip_conntrack-loopback-blues/