Is it possible to monitor the fair proxy balancer?

roallen · June 28, 2008, 2:23am

Periodically one or more of my mongrel instances will stop getting
requests from nginx (via upstream fair). The mongrel process is still
running, but not getting any requests.

How can I verify if nginx has taken it out of service? Is it possible
to get details on the current status of the fair proxy?

I also see the following error in syslog, but I’m unsure if it is
related…

nginx[17280]: segfault at 00007fffa0869fd0 rip 00002ac509ea61e3 rsp
00007fffa0869ed0 error 6

Robbie

roallen · June 28, 2008, 2:45am

On Sat, Jun 28, 2008 at 2:23 AM, Robbie A. [email protected]
wrote:

Periodically one or more of my mongrel instances will stop getting
requests from nginx (via upstream fair). The mongrel process is still
running, but not getting any requests.

Our experience is that the “fair” load balancer is unstable under
heavy load, and tends to gradually pull upstreams out of the pool.
This may be what you are experiencing.

How can I verify if nginx has taken it out of service? Is it possible
to get details on the current status of the fair proxy?

Not at the moment. If you want this kind of information, I recommend
HAProxy. In addition to providing machine-readable per-backend stats,
it also renders the same information as an HTML page:

http://www.igvita.com/posts/05-08/haproxy-large.png

I also see the following error in syslog, but I’m unsure if it is
related…

nginx[17280]: segfault at 00007fffa0869fd0 rip 00002ac509ea61e3 rsp
00007fffa0869ed0 error 6

That’s a segment fault – ie., Nginx is crashing on you.

Alexander.

roallen · June 28, 2008, 4:04am

Guess I’ll have to start looking at HAProxy, but
I’d rather not.

nginx is a very stable, flexible and powerful server for sure. Hopefully
you can keep with it.

The issue of not having any direct stats/reporting about the upstream
status keeps coming up over and over. I think it is the only significant
drawback and would love to see it addressed. Unfortunately folks tend
to post that the monitoring is not the job of nginx. However I disagree

nginx is the one that pulls and upstream out of rotation, and at a
minimum it should at least have a way to signal another process. It
would be great to see this addressed once and for all - hopefully one of
these days it will get the priority and attention it deserves IMO.

roallen · June 28, 2008, 3:08am

Alexander S. wrote:

On Sat, Jun 28, 2008 at 2:23 AM, Robbie A. [email protected]
wrote:
Our experience is that the “fair” load balancer is unstable under
heavy load, and tends to gradually pull upstreams out of the pool.
This may be what you are experiencing.

Ok.

Not at the moment. If you want this kind of information, I recommend
HAProxy. In addition to providing machine-readable per-backend stats,
it also renders the same information as an HTML page:

http://www.igvita.com/posts/05-08/haproxy-large.png

Kind of a big limitation with fair if it provides ZERO instrumentation.

I also see the following error in syslog, but I’m unsure if it is
related…

nginx[17280]: segfault at 00007fffa0869fd0 rip 00002ac509ea61e3 rsp
00007fffa0869ed0 error 6

That’s a segment fault – ie., Nginx is crashing on you.

Alexander.

Yeah, I just don’t know if it is related to fair. Perhaps the debug log
will help.

Guess I’ll have to start looking at HAProxy, but I’d rather not.

Robbie

roallen · June 28, 2008, 8:12am

To me, nginx just needs a simple API that could be worked with:

a) for reporting (I’d like to see bytes and requests per each Host:
header) and for the below stuff…

b) for dynamic addition/removal of upstreams (external healthchecking
scripts can tell nginx when to add/remove upstreams) - this seems to
be coming up a lot too

roallen · June 28, 2008, 1:58pm

Hi,

On sob, cze 28, 2008 at 02:23:02 +0200, Robbie A. wrote:

Periodically one or more of my mongrel instances will stop getting
requests from nginx (via upstream fair). The mongrel process is still
running, but not getting any requests.

The standard question: are you running the latest snapshot? (there was
an update about a week ago). If not, please give it a try.

I guess I should start versioning the module

How can I verify if nginx has taken it out of service? Is it possible
to get details on the current status of the fair proxy?

No, not really. This is something I’d like to do but currently there’s
no support for pluggable status reports and I think that writing what
would effectively become another module for monitoring upstream_fair is
slightly overkill.

–with-debug and debug_http may help but you’d probably drown in the
massive amount of logs (not only from upstream_fair, nginx itself is
very chatty too).

I also see the following error in syslog, but I’m unsure if it is
related…

nginx[17280]: segfault at 00007fffa0869fd0 rip 00002ac509ea61e3 rsp
00007fffa0869ed0 error 6

That one looks strange. It’s a segfault while accessing the stack
above the stack pointer, which should be legal, unless something has
just allocated at least 304 bytes of stack space and overflowed it.

I can’t see any large stack allocations in upstream_fair (though I may
have overlooked something), so it may come from nginx itself as well.

Please try increasing the stack size (ulimit -s in your
nginx startup script).

If such a segfault repeats (or if you haven’t restarted nginx since
then; reloads are fine), please collect:

result of pmap pid-of-any-nginx-worker (they should have the same
memory map), alternatively cat /proc//maps
the log line with the faulting address, rip and rsp (like above)
your nginx binary

The above information should prove helpful while tracing the cause of
the crash.

Best regards,
Grzegorz N.

roallen · June 28, 2008, 6:23pm

On 6/28/08, Grzegorz N. [email protected] wrote:

However, there still remains the issue of communication between the load
balancer and the outside world, i.e. how would you like to be told
that a backend has been deemed up/down and how would you like to tell
nginx that backend 1.2.3.4 is currently down?

Simple healthchecking is fine with me. Like ldirectord - I have it
request a PHP file every few seconds (I forget what it’s set at) - if
it fails (i.e. the PHP file does not return the expected result) then
$EVENT occurs - which could be as simple as saying “hey, nginx - this
one is down. just stop trying to use it”

As for “how” I don’t know. Perhaps like the nginx-status thing,
there’s an nginx-api URI defined that accepts pre-defined http auth
and a few basic REST-style commands to control it. Or communication
over a socket, or a special TCP port… that’s up to the developers
who understand the architecture better than I do as to what makes the
most sense with nginx I don’t really care as long as it is easy to
interact with and doesn’t require C knowledge and linked libraries to
use it

As for dynamically adding/removing backends, mentioned elsethread, it
isn’t trivial as it would basically require restarting nginx workers anyway
(at least for upstream_fair, which keeps its state in shared memory).
Disabling/enabling predefined backends would be fine though.

I suppose having a list of -all- available servers setup in the config
and using the API I mentioned would be fine too.

I mean right now this can be done with a hack, by pulling out the list
of upstreams into an include file, and writing to that include file
and sending nginx the appropriate HUP/etc signal to re-read the
config, but that seems a bit messy.

roallen · June 28, 2008, 2:43pm

On sob, cze 28, 2008 at 03:08:46 +0200, Robbie A. wrote:

Kind of a big limitation with fair if it provides ZERO instrumentation.

I’ll try to hack something up to provide upstream_fair statistics but
it’ll
require a patch to nginx (i.e. --add-module won’t be enough). Hopefully
Igor
agrees to incorporate it.

As for dynamically adding/removing backends, mentioned elsethread, it
isn’t trivial as it would basically require restarting nginx workers
anyway
(at least for upstream_fair, which keeps its state in shared memory).
Disabling/enabling predefined backends would be fine though.

However, there still remains the issue of communication between the load
balancer and the outside world, i.e. how would you like to be told
that a backend has been deemed up/down and how would you like to tell
nginx that backend 1.2.3.4 is currently down?

Also, while I have your attention

Alex complained in his blog post that upstream_fair does not provide a
way
to limit the maximum number of requests per backend. As parsing of
‘server’
directives in upstream{} blocks is done by nginx, I cannot easily add
options there, so I see two possibilities:

a new option, e.g. max_requests 10 10 20 20 (specifying the number
for each backend in the order of server directives)
overloading (with old/new/both behaviours possibly selectable by a
per-upstream flag) the meaning of weight=X parameter

So, what say you, is such a feature (amounting to returning 502 errors
after a certain amount of concurrent requests is reached) generally
desired? If so, how would you like to configure it?

BTW, I seem to receive more flak about upstream_fair recently, are you
people starting to use it?

Best regards,
Grzegorz N.

roallen · June 28, 2008, 9:38pm

On Sat, Jun 28, 2008 at 2:31 PM, Grzegorz N.
[email protected] wrote:

However, there still remains the issue of communication between the load
balancer and the outside world, i.e. how would you like to be told
that a backend has been deemed up/down

HAProxy – apologies for having to mention it again, but it’s a useful
template – has a simple status page similar to Nginx’s stub status.
It comes in HTML and CSV formats, and lists all backends (and
frontends) and their status (up, down, going down, going up) and a ton
of metrics (current number of connections, number of bytes transfered,
error count, retry count, and so on). It can also export the same
information on a secure domain socket if you don’t want to go through
HTTP.

and how would you like to tell
nginx that backend 1.2.3.4 is currently down?

Pardon me for asking a naive question, but to change the list of
backends, would you not simply edit the config file and do a SIGHUP? I
would reset whatever internal structures that are kept by the workers,
but I can’t think of anything that’s not okay to lose.

a new option, e.g. max_requests 10 10 20 20 (specifying the number
for each backend in the order of server directives)

That’s a horrible syntax and one that is going to cause problems as
you add or remove backends from the config. A max_requests setting
belongs on each backend declaration.

So, what say you, is such a feature (amounting to returning 502 errors
after a certain amount of concurrent requests is reached) generally
desired? If so, how would you like to configure it?

You should only return an error if a request cannot be served within a
given timeout, not when all backends are full.

Alexander.

roallen · June 28, 2008, 10:41pm

On Sat, Jun 28, 2008 at 09:54:06PM +0200, Grzegorz N. wrote:

The question wasn’t really caused by the perceived impossibility of the
task A status page is certainly simple enough (i.e. fits in the nginx
model somewhat), though it has the disadvantage that you have to poll it
periodically. I don’t think that a dedicated socket for querying
backends is a good design for nginx, so I’d like to gather ideas about
how to notify the outside world. A log message? Sending a signal
somewhere? An SNMP trap? Every way has its advantages and disadvantages,
so I’d like to pick the one that sucks the least.

i don’t know anything about nginx internals, but i’d imagine the
possible way is a control socket through which the controlling program
connects to nginx (it can be used as two way communication).

roallen · June 28, 2008, 10:02pm

On sob, cze 28, 2008 at 09:28:45 +0200, Alexander S. wrote:

of metrics (current number of connections, number of bytes transfered,
error count, retry count, and so on). It can also export the same
information on a secure domain socket if you don’t want to go through
HTTP.

The question wasn’t really caused by the perceived impossibility of the
task A status page is certainly simple enough (i.e. fits in the nginx
model somewhat), though it has the disadvantage that you have to poll it
periodically. I don’t think that a dedicated socket for querying
backends is a good design for nginx, so I’d like to gather ideas about
how to notify the outside world. A log message? Sending a signal
somewhere? An SNMP trap? Every way has its advantages and disadvantages,
so I’d like to pick the one that sucks the least.

and how would you like to tell
nginx that backend 1.2.3.4 is currently down?

Pardon me for asking a naive question, but to change the list of
backends, would you not simply edit the config file and do a SIGHUP? I
would reset whatever internal structures that are kept by the workers,
but I can’t think of anything that’s not okay to lose.

Yes. That’s the obvious solution but apparently not always acceptable,
especially when you’d want to use an external monitoring system to do
this automatically.

a new option, e.g. max_requests 10 10 20 20 (specifying the number
for each backend in the order of server directives)

That’s a horrible syntax and one that is going to cause problems as
you add or remove backends from the config. A max_requests setting
belongs on each backend declaration.

Like I wrote in the snipped part, I cannot easily add options to the
server directives (at least without patching nginx or reinventing the
square wheel). I don’t like the max_requests idea too, for precisely the
same reason. I presume that means the overloading of weight=X is at
least acceptable.

So, what say you, is such a feature (amounting to returning 502 errors
after a certain amount of concurrent requests is reached) generally
desired? If so, how would you like to configure it?

You should only return an error if a request cannot be served within a
given timeout, not when all backends are full.

Will have to think about it. This has the potential of busy-looping when
all the backends are indeed full (or down, but then one can just send a
hard error and be done with it). I don’t think nginx has a way to be
told “everything is unavailable now, come back to me in a second or
two” or even better “I’ll tell you when to ask me again”.

Best regards,
Grzegorz N.

roallen · June 29, 2008, 1:10am

On Sat, Jun 28, 2008 at 9:54 PM, Grzegorz N.
[email protected] wrote:

I’d like to gather ideas about
how to notify the outside world. A log message? Sending a signal
somewhere? An SNMP trap? Every way has its advantages and disadvantages,
so I’d like to pick the one that sucks the least.

Why just one? A status page supplemented by machine-readable log
output is a good solution that I think would satisfy most sysadmins.

Pardon me for asking a naive question, but to change the list of
backends, would you not simply edit the config file and do a SIGHUP? I
would reset whatever internal structures that are kept by the workers,
but I can’t think of anything that’s not okay to lose.

Yes. That’s the obvious solution but apparently not always acceptable,
especially when you’d want to use an external monitoring system to do
this automatically.

What’s simpler for an external monitoring system than sending a signal
to a process?

Of course, you could go all the way and do a Varnish-style admin
interface. I have mentioned Varnish before on this list. Varnish has a
pretty clever admin/monitoring infrastructure. For example, you can
load multiple configs and selectively enable them:

$ varnishadm vcl.load test /etc/varnish/test.vcl
$ varnishadm vcl.use test

… something goes horribly wrong …

$ varnishadm vcl.use boot

The use of named configs means the input can be anything (even your
default set of config files). You can load it, try it out, and unload
it.

You could do worse than looking at Varnish’s logging system for ideas.
Varnish uses circular buffers in shared memory for logging, and its
logs are explicitly machine-readable, each line being a tag followed
by a value. So log output looks like this:

14 Debug c “Hash Match:
/-/cache/border/w=6;h=6;sw=true;sx=0;sy=3;sbr=10;sbs=5;sm=10;sp=0;c=fff;t=r_24.png#origo.no#”
14 Hit c 1402130806
14 VCL_call c hit
14 VCL_return c deliver
14 Length c 217
14 VCL_call c deliver
14 VCL_return c deliver
14 TxProtocol c HTTP/1.1
14 TxStatus c 200
14 TxResponse c OK
14 TxHeader c Status: 200 OK

and so on.

In addition to making it superbly easy for scripts to graph, analyze
and monitor activity in real time, this lets you tail the log for
specific events or strings, and since it’s all RAM-based, you can get
real-time, low-overhead debug log output immediately without changing
any configuration settings or reloading the daemon. As far as I know,
Varnish only logs when you listen to log output and filtered by what
you’re listening for, but I could be wrong.

Using shared memory with Nginx’s worker process model should not pose
any problems as each worker could maintain its own shared memory and
thus avoid the need for locking.

same reason. I presume that means the overloading of weight=X is at
least acceptable.

I think you have to push Igor for a more flexible internal
infrastructure.

Even something string-based would work, even if it would be hackier
than a true syntax:

server 127.0.0.1:10000 option = [option …];

Eg.,

server 127.0.0.1:10000 option fair.max_conns=5;

You should only return an error if a request cannot be served within a
given timeout, not when all backends are full.

Will have to think about it. This has the potential of busy-looping when
all the backends are indeed full (or down, but then one can just send a
hard error and be done with it). I don’t think nginx has a way to be
told “everything is unavailable now, come back to me in a second or
two” or even better “I’ll tell you when to ask me again”.

I think Nginx needs something like this.

Alexander.