Introducing backend healthchecking plugin

Jack_Lindamood · February 27, 2010, 2:26am

I’ve written a plugin that can health check nginx backends, which
everyone is free to use. This is similar to the healthchecking features
that varnish and haproxy support. Here’s a sample config[1] just to
give you an idea, that uses the upstream_hash plugin. You can get the
code here [2] and an example of how to patch upstream_hash here [3].
The plugin is actually an optional feature that other upstream plugins,
like upstream fair or iphash for example, can plug into and use. To use
it, their code needs to be modified to also check the health of the
backend.

This plugin is super beta, so please be careful. Feedback/patches
welcome.

[1]
upstream test_upstreams {
server localhost:11114;
server localhost:11115;

hash $filename;
hash_again 10;
healthcheck_enabled;
healthcheck_delay 1000;
healthcheck_timeout 1000;
healthcheck_failcount 1;
healthcheck_expected 'I_AM_ALIVE';
healthcheck_send "GET /health HTTP/1.1" 'Host: www.mysite.com'

‘Connection: close’;
}

[2]

[3]

Jack_Lindamood · March 1, 2010, 11:49am

On Feb 26, Jack Lindamood wrote:

This plugin is super beta, so please be careful. Feedback/patches
welcome.

I have been wanting to write something similar from a long time, so
thanks for getting started.

Does the health check compete with the existing logic to mark an
upstream as up/down? Here is a scenario:

The real traffic goes to this upstream url “/service/login”. My health
check url is configured as “/hc”. Now /hc is always available by
/service/login is thowing up a lot of errors like timeouts, 500 etc.
etc for a given upstream server. What will the status be eventually
marked as?

Jack_Lindamood · March 2, 2010, 8:58am

It’s up to the logic in your upstream. The module just runs the health
check. Your upstream module will have to be changed to query the status
of the healthcheck and decide if it should continue anyways, or pick
another upstream server.

Posted at Nginx Forum:

Jack_Lindamood · March 2, 2010, 9:35am

Did you look at ngx_supervisord[1]?

Do you mean use only ngx_supervisord? From what I can tell, it requires
a separate daemon running on the machine that does the healthcheck,
which then sends a call to nginx via supervisord to turn servers off or
on. Is that correct? The goal was to build the checking into nginx so
that we don’t have to monitor another process, making the system more
stable. If nginx is running, then you’re guaranteed the health checks
are running.

Or did you mean call into ngx_supervisord when a healthcheck fails? I’m
happy to integrate other APIs if you have some ideas.

Posted at Nginx Forum:

Jack_Lindamood · March 2, 2010, 10:09am

On Tue, Mar 02, 2010 at 03:34:36AM -0500, cep221 wrote:

Did you look at ngx_supervisord[1]?

Do you mean use only ngx_supervisord? From what I can tell, it requires a separate daemon running on the machine that does the healthcheck, which then sends a call to nginx via supervisord to turn servers off or on. Is that correct? The goal was to build the checking into nginx so that we don’t have to monitor another process, making the system more stable. If nginx is running, then you’re guaranteed the health checks are running.

Or did you mean call into ngx_supervisord when a healthcheck fails? I’m happy to integrate other APIs if you have some ideas.

I mean using the same API so that patching a balancer for
ngx_supervisord support would automatically make it support your
healthcheck module. This means that healthcheck would get passed a
callback which should be executed with specified parameters when it
detects a failed backend.

See:

github.com

FRiCKLE/ngx_supervisord/blob/master/patches/ngx_http_upstream_round_robin.patch

--- src/http/ngx_http_upstream_round_robin.c.orig	Mon Jan  4 05:26:58 2010
+++ src/http/ngx_http_upstream_round_robin.c	Mon Jan  4 05:27:44 2010
@@ -7,14 +7,46 @@
 #include <ngx_config.h>
 #include <ngx_core.h>
 #include <ngx_http.h>
+#include <ngx_supervisord.h>
 
+#if (NGX_SUPERVISORD_API_VERSION != 2)
+  #error "ngx_supervisord-aware upstream requires NGX_SUPERVISORD_API v2"
+#endif
 
+
+/*
+ * disable sorting
 static ngx_int_t ngx_http_upstream_cmp_servers(const void *one,
     const void *two);
+ */
 static ngx_uint_t
 ngx_http_upstream_get_peer(ngx_http_upstream_rr_peers_t *peers);

This file has been truncated. show original

(grep for ngx_http_upstream_backend_monitor).

I don’t expect you to use ngx_supervisord itself (or supervisord), that
would indeed be quite pointless

@Piotr: if the health check plugin adapts to ngx_supervisord API, I
guess the API spec could use s/_supervisord//ig, no? A bunch of #defines
should suffice.

Best regards,
Grzegorz N.

Jack_Lindamood · March 2, 2010, 11:21am

Do you mean use only ngx_supervisord? From what I can tell, it requires a
separate daemon running on the machine that does the healthcheck, which
then sends a call to nginx via supervisord to turn servers off or on. Is
that correct?

No, it’s not

Neither ngx_supervisord nor supervisord do any healt-checking.
ngx_supervisord communicates with supervisord (process manager) to
dynamically start or stop backend servers, depending on the load. Since
version 1.3 it also supports supervisord-less configuration which
disables
all communication with supervisord daemon (so it basically takes backend
servers out of rotation without the need to reload nginx).

Or did you mean call into ngx_supervisord when a healthcheck fails? I’m
happy to integrate other APIs if you have some ideas.

Yeah, that’s probably what Grzegorz meant. You would just need to call
ngx_supervisord_execute(uscf, NGX_SUPERVISORD_CMD_STOP, backend_number,
NULL) and then all ngx_supervisord-aware load balancers (upstream_fair,
round_robin & ip_hash) would automagically stop using failed backend
until
you would execute NGX_SUPERVISORD_CMD_START.

Full API spec is available at:

github.com

FRiCKLE/ngx_supervisord/blob/master/patches/ngx_http_upstream_round_robin.patch

--- src/http/ngx_http_upstream_round_robin.c.orig	Mon Jan  4 05:26:58 2010
+++ src/http/ngx_http_upstream_round_robin.c	Mon Jan  4 05:27:44 2010
@@ -7,14 +7,46 @@
 #include <ngx_config.h>
 #include <ngx_core.h>
 #include <ngx_http.h>
+#include <ngx_supervisord.h>
 
+#if (NGX_SUPERVISORD_API_VERSION != 2)
+  #error "ngx_supervisord-aware upstream requires NGX_SUPERVISORD_API v2"
+#endif
 
+
+/*
+ * disable sorting
 static ngx_int_t ngx_http_upstream_cmp_servers(const void *one,
     const void *two);
+ */
 static ngx_uint_t
 ngx_http_upstream_get_peer(ngx_http_upstream_rr_peers_t *peers);

This file has been truncated. show original

At the moment one would need to specify “supervisord none;” in order to
enable supervisord-less configuration, because there is no such call in
API,
but I could add this in next release if you would like to use it.

Best regards,
Piotr S. < [email protected] >

Jack_Lindamood · March 2, 2010, 11:24am

@Piotr: if the health check plugin adapts to ngx_supervisord API, I
guess the API spec could use s/_supervisord//ig, no? A bunch of #defines
should suffice.

Sure, why not

Best regards,
Piotr S. < [email protected] >

Jack_Lindamood · March 2, 2010, 9:06am

Hi,

(piggybacking on Arvind’s mail because I don’t seem to have the original
post).

On Mon, Mar 01, 2010 at 04:18:32PM +0530, Arvind Jayaprakash wrote:

This plugin is super beta, so please be careful. Feedback/patches
welcome.

Did you look at ngx_supervisord[1]? It serves a roughly similar purpose
(more direct interaction of load balancers with the outside world) and
it also requires patches to the load balancer. So maybe we could kill
two birds with one stone and use the same API.

[1] FRiCKLE Labs / nginx / ngx_supervisord

Best regards,
Grzegorz N.

Jack_Lindamood · March 2, 2010, 7:27pm

On Tue, Mar 2, 2010 at 10:17 AM, Arvind Jayaprakash
[email protected] wrote:

(2) In addition, the health-check module provides an out of band health
check mechanism wherein, it periodically polls a specific url and uses
the HTTP status/body to determine if an upstream needs to be marked as
up or down

+1

This has been something in my mind that will help advance nginx
further as a load balancing option.

Jack_Lindamood · March 2, 2010, 7:18pm

On Mar 02, Piotr S. wrote:

enable supervisord-less configuration, because there is no such call in API,
but I could add this in next release if you would like to use it.

This sonuds like what I needed. So let me rephrase the problem in its
entirity:

(1) Under normal circumstances, nginx would use proxy_next_upstream in
conjunction with max_fails and fail_timeout (for the rr module) to
declare an upstream as up or down. This is an inline check since it is
monitoring real traffic.

(2) In addition, the health-check module provides an out of band health
check mechanism wherein, it periodically polls a specific url and uses
the HTTP status/body to determine if an upstream needs to be marked as
up or down

Both these styles have their own benefits.

The first style keeps track of the health by looking at the response of
actual requests. This is important since a health check url does not
automatically indicate the health of your real application.

The second style is needed in cases where we plan to do some maintanence
activity on an upstream server and want to proactively not send traffic
to it. Typical example is when you want to push new software, check with
a couple of requests and see if your app is behaving well and if all
looks fine, direct traffic to it.

In the absence of priority between the two styles of checking, we could
end up with a flapping upstream status. The logical priority seems to be
that #2 wins over #1. So, if the health check url says an upstream
server is down, no traffic should be sent it way and the health status
evaluation based on style #1 should be ignored. If the health check
deems and upstream to be up, then the outcome of #1 is the final status.

So where do we get these features from:

#1 is provided by the stock upstream modules
#2 is provided by the health check module
the ngx_supervisord module seems have the hooks that will let us
achieve the prioritization once the health-check module uses this
feature

To get all of this running, we would need 2 patches on the upstream
module; one for supervisord and the other for health-check and the
health-check module itself will have to invoke ngx_supervisord_execute
to mark an upstream as up or down.

I will not have time before this weekend to get started on merging
these; so if someone gets down to doing it earlier, thanks

Jack_Lindamood · March 2, 2010, 11:11pm

The second style is needed in cases where we plan to do some maintanence
activity on an upstream server and want to proactively not send traffic
to it. Typical example is when you want to push new software, check with
a couple of requests and see if your app is behaving well and if all
looks fine, direct traffic to it.

Actually, you don’t need health-checks for that, ngx_supervisord
provides
that functionality out-of-the-box. If you plan on taking servers down
for
maintenance, you can use supervisord_stop and supervisord_start
handlers.
Please take a look at example configuration #2, because it does exactly
that:
http://labs.frickle.com/nginx_ngx_supervisord/README

In the absence of priority between the two styles of checking, we could
end up with a flapping upstream status. The logical priority seems to be
that #2 wins over #1. So, if the health check url says an upstream
server is down, no traffic should be sent it way and the health status
evaluation based on style #1 should be ignored. If the health check
deems and upstream to be up, then the outcome of #1 is the final status.

In-line checks are considered failures. When you take down server with
ngx_supervisord, it’s marked down (same way as if you would add “down”
in
your configuration and reload nginx), so it takes priority.

I will not have time before this weekend to get started on merging
these; so if someone gets down to doing it earlier, thanks

I’ll try to find some time today, can’t promise anything though.

Best regards,
Piotr S. < [email protected] >

Jack_Lindamood · March 3, 2010, 3:48am

On Mar 02, Piotr S. wrote:

that:
http://labs.frickle.com/nginx_ngx_supervisord/README

The only philosophical objection I have to this style is having
something to do on the LB servers to change and upstream status. I
wanted a facility wherein something can be done on the upstream server
(rm the health check file) to remove it out of service.

I come from a world wherein any sort of access on the LB servers is
tightly controlled and the people managing the upstreams (application
servers) can plan a maintanence without ever involving the LB folks.

Am also not a fan of the supervisord_start/supervisord_stop directives
since managing the security aspects of it becomes a hassle.

I am however a fan of supervisord_inherit_backend_status which is why I
wanted to integrate it with the health-check plugin

Jack_Lindamood · March 8, 2010, 10:16pm

I’ll try to find some time today, can’t promise anything though.

It took a little more time than expected, but I just pushed modified
version
of health-check module into my temporary repository:

It communicates with ngx_supervisord via its API, which means that
servers
are taken out of the rotation when healthcheck fails.

DISCLAIMER: Health-check module doesn’t work for me at all (checks
always
time out and because of that servers are taken out of the rotation by
ngx_supervisord), but if it works for you then this modified version
should
work as well.

Best regards,
Piotr S. < [email protected] >

Jack_Lindamood · March 9, 2010, 6:42pm

DISCLAIMER: Health-check module doesn’t work for me at all

To redeem the module, I checked over your config and it was missing the
“Host:” header in your healthcheck_send. Try using something similar to
the sample config:

healthcheck_send "GET /health HTTP/1.1" 'Host: www.ahost.com'

‘Connection: close’;

Posted at Nginx Forum:

Jack_Lindamood · March 3, 2010, 5:09am

since managing the security aspects of it becomes a hassle.

I am however a fan of supervisord_inherit_backend_status which is why I
wanted to integrate it with the health-check plugin

I can see your point.

The question which arises now is:
Should health-check or any other ngx_supervisord-aware load balancer be
able
to “enable” backend server which was administratively taken out of the
rotation with “server A.B.C.D down;” in nginx.conf?

Best regards,
Piotr S. < [email protected] >

Jack_Lindamood · March 9, 2010, 7:48pm

Hello Jack,

To redeem the module, I checked over your config and it was missing the
“Host:” header in your healthcheck_send. Try using something similar to
the sample config:

healthcheck_send “GET /health HTTP/1.1” ‘Host: www.ahost.com’
‘Connection: close’;

No, it wasn’t. Missing header wouldn’t yield timeout. But just for the
sake
of it, I tested it with the above line and it didn’t help.

Like I said yesterday, I don’t really have time to fully investigate
this
right now, but I’ll try to narrow down / fix the problem over the
weekend.

Best regards,
Piotr S. < [email protected] >