Nginx reverse proxy crash when dns unavailable

Hi,

I am currently planning to use nginx on several thousand devices as a
reverse-proxy caching system.

It currently work as expected (thanks Igor!), caches files as they are
being requested by the devices.

The only problem we hit is when nginx starts faster than the dns sytem
is available on the units. Nginx will crash saying it is unable to
connect to the remote host being proxied.

1739#0: host not found in upstream “content.dev.local” in
/usr/local/nginx/conf/nginx.conf 33

Staring nginx again and it work (as the DNS is now responding properly).

Any idea on how to work around this or should i fill a bug report (Nginx
shouldn’t crash when the remote is not available, but should try on
requests to access it).

Nginx does not die when the remote drops and come back (by pulling the
network cable for example). It only crash when nginx is launched and the
dns sytem is not yet available.

Posted at Nginx Forum:

Hello!

On Thu, Oct 22, 2009 at 02:02:14PM -0400, masom wrote:

1739#0: host not found in upstream “content.dev.local” in /usr/local/nginx/conf/nginx.conf 33
Crashes and refuses to start is quite a different things. As you
have no DNS available during start - nginx just can’t proceed any
further since it doesn’t know what your config means. Once
started it won’t depend on DNS anymore.

To avoid such issues on start there are two basic options:

  1. Use ip addresses in config instead of host names.

  2. Make sure your OS resolving subsystem always returns meaningful
    results to nginx - either by launching nginx once DNS is available
    or by adding relevant entries to /etc/hosts.

Maxim D.

But shoulnd’t nginx start anyway if the end point is not responding and
just try to reach it anyway?

I can’t really see why it would need to stop or crash when either the
endpoint (apache) or the dns system is unavailable.

Yes it should display 5xx errors saying the endpoint is unreachable (dns
or server failure / not-responding) but nginx should not “lock up” after
1 bad answer.

Current problem:

unit starts
dhcp kicks in
nginx get started before dhcp process is completed
nginx realize that content.dev.local is not reachable (dns settings are
not yet set by dhcp)
nginx exits
Browser on unit starts, says address is unreachable (as nginx did not
start).

Shouldn’t nginx just attempt to connect to the end point as requests are
coming in?

The solution we consider is the hosts file that would always point to a
static ip for the content server, but would be a little management
problem as we are deploying in several different location with different
networks.

Posted at Nginx Forum:

I’ve seen this as well. If my DNS server becomes unavailable, even
temporarily, at least one nginx worker will crash, using 100% cpu time
and never responding to requests again. It used to be that my DNS
server would be automatically restarted, and this problem would not
necessarily affect all workers per incident, i.e. if I have 8 workers,
maybe 1 or 2 would permanently lock up when the DNS server fails. The
rest would continue processing requests once DNS became available, and
the crashed workers would spin 1 core cpu at 100% until I restart
nginx.

It would be handy if this were solved, also, it would make sense if
you could specify more than one resolver for nginx to use, or, if
nginx would default to using the resolvers in the /etc/resolv.conf.

Since we’re on the subject, it seems the same behavior happens during
an upstream failure. nginx children are working fine, then an upstream
such as apache on the machine temporarily fails or is unresponsive,
then nginx stops working properly even after apache rights itself, and
the whole thing can only be solved by restarting nginx. In this case,
the nginx workers will either not respond to requests at all (sending
a blank page) or will respond 500 bad gateway, even though connecting
to the gateway directly on it’s own port works fine, and restarting
nginx solves the issue.

Nginx has definitely been great for me, and if these two things didn’t
tend to happen from time to time, I would consider it bulletproof.

Thanks,
Gabe

I just re-read your reply. Are you saying that if I don’t use a
resolver in the nginx.conf, it will use the system resolvers? If the
first system resolver doesn’t work, will nginx automatically fall back
to other resolvers in the /etc/resolv.conf file without encountering
this failure condition (obviously in this case there will be some
delay waiting for the first resolver to time out)?

I would say I’m seeing a similar issue. I don’t know if it hard locks
after one failed request or if it takes several, but, I do see a
continued lockup on an nginx worker if that worker encounters a
failure condition once. One would hope the worker would become
responsive again once the upstream service (apache or dns) becomes
responsive again, but it does not.

However, in your case, if Nginx tries to start before it has an ip
address to bind to, yes, you can expect it not to work. At the very
least, nginx can only bind to the addresses on the system, so you’ll
need to start / restart nginx after you’ve completed DHCP.

Actually in your case, if nginx is fully exiting, you can have a
script to just check if nginx is up, and if not, restart it. In my
case, nginx stays in a zombie status, considered up but not working
right.

Here’s a script I use to check if nginx and apache are up, and if
nginx is down, kill apache, start nginx, start apache. If apache is
down and nginx up, it just restarts apache.

Replace instances of “/usr/local/apache-php/bin/httpd” with wherever
you’ve installed apache.

#!/bin/bash

this=$(ps aux | grep httpd | awk ‘{print $11}’ | grep
“/apache-php/bin/httpd” | head -1)
if [ “$this” = “/usr/local/apache-php/bin/httpd” ]; then
echo “httpd (php) was found”
else
echo “httpd (php) was not found”
/usr/local/apache-php/bin/apachectl restart
sleep 1
/usr/local/apache-php/bin/apachectl start
fi

thing=$(ps aux | grep nginx | awk ‘{print $14}’ | grep nginx | head -1)
if [ “$thing” = “/usr/local/nginx/sbin/nginx” ]; then
echo “nginx was found”
else
echo “nginx not found”
/usr/local/apache-php/bin/apachectl stop
sleep 4
killall /usr/local/apache-php/bin/httpd
sleep 4
killall -KILL /usr/local/apache-php/bin/httpd
sleep 4
ulimit -HSn 8192
/usr/local/nginx/sbin/nginx
sleep 4
/usr/local/apache-php/bin/apachectl restart
sleep 1
/usr/local/apache-php/bin/apachectl start
/usr/local/spri/spri -q
fi

You can just put this in a cron every 2 minutes, and this would seem
to solve your issue.

Hello!

On Thu, Oct 22, 2009 at 12:05:57PM -0700, Gabriel R. wrote:

I just re-read your reply. Are you saying that if I don’t use a
resolver in the nginx.conf, it will use the system resolvers? If the
first system resolver doesn’t work, will nginx automatically fall back
to other resolvers in the /etc/resolv.conf file without encountering
this failure condition (obviously in this case there will be some
delay waiting for the first resolver to time out)?

While parsing configuration nginx just uses system resolver via
gethostbyname() function. It’s up to system to handle this.

Maxim D.

Hello!

On Thu, Oct 22, 2009 at 12:04:26PM -0700, Gabriel R. wrote:

I’ve seen this as well. If my DNS server becomes unavailable, even
temporarily, at least one nginx worker will crash, using 100% cpu time
and never responding to requests again. It used to be that my DNS
server would be automatically restarted, and this problem would not
necessarily affect all workers per incident, i.e. if I have 8 workers,
maybe 1 or 2 would permanently lock up when the DNS server fails. The
rest would continue processing requests once DNS became available, and
the crashed workers would spin 1 core cpu at 100% until I restart
nginx.

Resolving via gethostbyname() during configuration parsing and
resolving via nginx internal async resolver while using proxy_pass
with variables are quite a different things either. Using second
one in production isn’t good idea unless really needed.

Internal async resolver known to have problems at least in stable
branch. Several problems was fixed in 0.8.*, but I’m sure this
doesn’t cover all of them.

Maxim D.

Thanks for the script.

Posted at Nginx Forum:

Alright, that makes sense. I’m more referring to resolving names after
startup, in the process of handling reproxy requests. In that case, I
believe I have to set a resolver in the nginx.conf, and if that
resolver is not working right, many times nginx will hang, and require
a restart, but without actually crashing which would be much better as
my uptime script would notice and restart it on its own.

2009/10/22 Maxim D. [email protected]:

Hello!

On Thu, Oct 22, 2009 at 03:53:32PM -0400, masom wrote:

nginx get started before dhcp process is completed
nginx realize that content.dev.local is not reachable (dns settings are not yet set by dhcp)
nginx exits
Browser on unit starts, says address is unreachable (as nginx did not start).

Shouldn’t nginx just attempt to connect to the end point as requests are coming in?

Probably I’m not explained well enough.

When nginx have something it may attempt to connect to - it will
happily work. But in case of failed name resolution during
configuration parsing it just don’t have an ip.

When you write in the config something like

location /pass-to-backend/ {
    proxy_pass http://backend;
}

hostname “backend” is resolved during config parsing via standard
function gethostbyname(). This function is blocking and therefore
can’t be used during request processing in nginx workers as it
will block all clients for unknown period of time. So this
function is only used during config parsing, hostname “backend”
resolved to ip address[es], and later during request processing
this ip is used without further DNS lookups.

If “backend” can’t be resolved during config parsing there are
basically two options:

  1. Work as is, always returning 502 when user tries to access uri
    that should be proxied. We have no ip to connect() to, remember?

  2. Refuse to start, assuming administrator will fix the problem
    and start us normally.

Option (1) probably better in situations where you have
improperly configured system without any reliability implemented
that have to start unattended at any cost and do at least
something.

But it’s not really wise to do (1) in normal situation. It will
basically start service in broken and almost undetectable state.
Consider it’s the part of big cluster - new node comes up, seems
to work. But for some requests it returns errors for no reason.
It’s administrative nightmare.

On the other hand, during reconfiguration, configuration testing,
binary upgrade and other attended operations the only sensible
thing to do is certanly (2). You wrote hostname in config that
can’t be resolved - it’s just configuration error.

Note well: note that there is quite a different mode of proxy_pass
operation, proxy_pass with variables, which may use nginx’s
internal async resolver. For this mode nginx won’t try to
resolve hostnames during configuration parsing, and nginx will
start perfectly even when dns isn’t available. But this

a) requires additional configuration (you have to configure ip of
your DNS server via resolver directive);

b) much more resource consuming;

c) internal nginx resolver known to have problems at least in
stable branch.

Therefore I can’t recommend using it in production.

The solution we consider is the hosts file that would always point to a static ip for the content server, but would be a little management problem as we are deploying in several different location with different networks.

I don’t really understand why not just impose correct
prerequisites before starting nginx. It’s not really hard to wait
before network comes up.

Maxim D.

Quote:

Note well: note that there is quite a different mode of proxy_pass
operation, proxy_pass with variables, which may use nginx’s
internal async resolver. For this mode nginx won’t try to
resolve hostnames during configuration parsing, and nginx will
start perfectly even when dns isn’t available. But this

a) requires additional configuration (you have to configure ip of
your DNS server via resolver directive);

b) much more resource consuming;

c) internal nginx resolver known to have problems at least in
stable branch.

Therefore I can’t recommend using it in production.


Yes, this is exactly the problem that I am having. I am using nginx to
proxy videos from other video servers, youtube and others. It works
great so long as resolution doesn’t fail. Now, for youtube in
particular, normally I don’t use the resolver anymore, I resolve in my
php app using a system nslookup command, request the url as an ip
based url, and set the Host header manually for the connection, such
that nginx connects to the upstream server with the appropriate host
header. This fairly well emulates a standard http request to the
domain name, without nginx doing resolution. I’m doing this for a
different reason than to work around the above problem, but it does
seem to work around the above problem as a side benefit. I can’t at
all imagine though that this is an efficient solution (running a shell
command from within php to resolve the names, and storing the result
in mysql for future reference), so I’m only using it for youtube where
I need to, and not for other video sites that I access.

I guess it makes sense to use just ip based urls when your proxy pass
directive is accessing sites under your control, as you should know
what the ips are ahead of time, but in my case I am not, I am
accessing arbitrary urls this way. Are there plans to fix the locking
issue when the dns server becomes temporarily unavailable, or whatever
the current problems are with the async resolver? Or can we at least
kill nginx rather than having it stay in a zombified state? As you
said, having some requests randomly fail when the server appears to be
up is an administrative nightmare. Sometimes only one worker will fail
and the others will be fine, making it difficult to notice that you
need to restart nginx. Or at least have support for more than one
resolver line in the config so it can fail back to another resolver if
one is not responding. I would be happy to sponsor any or all of these
developments if anyone is interested in doing the work.

Since we’re on the subject of proxy_pass, is there a way that, when
I’m doing something like this, and the proxied resource sends a 302,
to have nginx follow the 302 internally rather than sending the 302 to
the user’s browser? I’ve had issues where I access a youtube url, and
need to forward the video to the user via proxy, but youtube sends a
302, nginx passes the 302 to the user rather than passing the video
(located at the 302’ed address) to the user. I currently try to follow
all redirects in the php app before passing the url off to nginx, but
this is complicated and doesn’t work 100% of the time, so if there’s a
way to configure nginx to internally follow these redirects, that
would be ideal.

-Gabe

Thanks Maxim

I’ll see what voodoo i can come up with for our deployment… or maybe
submit a patch that would allow nginx to start even if the
getbyhostname() is not returning the ip, then attempt to resolv it
again later on as requests comes in.

Posted at Nginx Forum: