Feature requestED: monitoring Nginx from the outside

FranSSSSois_Battail · April 30, 2008, 9:38pm

I’ve almost finished the module (read: compilation reports no error) and
want to
take some time to discuss about some potential issues.

Interaction with Nginx

I try to not affect Nginx but some files need to be modified since this
module
offer a very simple API to other modules, so there are two paths:
mandatory
module but activated or not by the config file or conditional
compilation. I
believe conditional compilation is the norm but please let me know.

To access the shared memory I want to create a file: nginx.monitoring at
the
same location as nginx.pid. Then a ftok() on this file to get the System
V ID
and another process would be able to attach to the shared memory. But as
I’m a
Linux guy I don’t know if *BSD, Mac-OSX and Solaris would be OK with
this. Don’t
know too if all scripting languages can use System V calls.

So far I think it is clean but if you have a suggestion better you’re
welcome.

API

The API consist in one call:

void ngx_register_monitoring_value (ngx_str_t * name, ngx_atomic_t *
value) ;

This function should be called by an Nginx module during it’s
initialization to
tell the monitoring module that this variable should be added to the
pool for
watching.
That’s all. The module just continue to do what it wants with its
ngx_atomic_t
value, I will do the rest.

I may have some issue because my module has three states: initializing,
registering, running so may be the order of modules may be problematic.

Configuration

At top level just put ‘monitoring ;’. But some modules offering
variables to be
monitored may propose configuration options like:

monitor_timeout_counter on ;

and call or not ngx_register_monitoring_value() according to the
setting.

Scripts

A script should have read access to where nginx.pid is and will do the
following
pseudo code:

shm_id = ftok (“path.to.nginx.monitoring”) ;
mem = attach_to_shared_mem (shm_id) ;
values = parse (mem)
do_what_you_want_with_values (values)

for example: update RRD databases, search for an anormal error count,
search for
an event unexpected…

It’s no more my problem

Conclusion

Dear List, tell me what you think and I will release soon this module
(under A
BSD-like licence of course) for review and comments.

Best regards.

FranSSSSois_Battail · April 30, 2008, 10:27pm

François Battail ha scritto:

I’ve almost finished the module (read: compilation reports no error) and want to
take some time to discuss about some potential issues.

Good.

Interaction with Nginx

[…]

value, I will do the rest.

I may have some issue because my module has three states: initializing,
registering, running so may be the order of modules may be problematic.

When actually happens the running phase?
At the end of the HTTP request? At regular intervals?

Note that since you don’t know when a variable will be updated, you may
miss important changes in the variable.

As an example, in the upstream module if a peer is down, the variable
peer.down will be set to 1, but if all the peers are down, then all
the variables are set to 0, for a fast restart in the successive
request.

Configuration

[…]

Scripts

A script should have read access to where nginx.pid is and will do the following
pseudo code:

shm_id = ftok (“path.to.nginx.monitoring”) ;
mem = attach_to_shared_mem (shm_id) ;
values = parse (mem)
do_what_you_want_with_values (values)

I can’t see any form of synchronization here.

BSD-like licence of course) for review and comments.

Best regards.

Regards Manlio P.

FranSSSSois_Battail · April 30, 2008, 10:48pm

Manlio P. <manlio_perillo@…> writes:

Good.

Thanks.

When actually happens the running phase?
At the end of the HTTP request? At regular intervals?

I’ve set a hook in logging phase.

Note that since you don’t know when a variable will be updated, you may
miss important changes in the variable.

Yes, but it’s monitoring, not real time.

As an example, in the upstream module if a peer is down, the variable
peer.down will be set to 1, but if all the peers are down, then all
the variables are set to 0, for a fast restart in the successive request.

It’s a side effect, sorry I’m in the logic on monitoring.

Scripts
I can’t see any form of synchronization here.

Correct, but it’s monitoring not real time, value fetch is atomic and so
it’s
coherent at a given moment. Nginx is write only and use ngx_atomic_t
variables;
it’s a snapshot at a given moment. Well can I said it’s just monitoring

Thank you for your input,

Best regards.

FranSSSSois_Battail · May 1, 2008, 10:41am

François Battail ha scritto:

coherent at a given moment. Nginx is write only and use ngx_atomic_t variables;
it’s a snapshot at a given moment. Well can I said it’s just monitoring

Ok, I know that the implementation does atomic fetch of monitored
variables.

But the purpose of this module is to make these variables available from
an external client, right?

However if the external client and Nginx do not synchronize the access
to the shared memory, then the client will potentially read wrong
values.

Thank you for your input,

Best regards.

Regards Manlio P.

FranSSSSois_Battail · May 1, 2008, 11:06am

Manlio P. <manlio_perillo@…> writes:

But the purpose of this module is to make these variables available from
an external client, right?

Not really available, but sampled to make statistics.

However if the external client and Nginx do not synchronize the access
to the shared memory, then the client will potentially read wrong values.

I don’t think so. Each variable will be correct and is not supposed to
be
related to another one (this is a false assertion if your point of view
is
debugging). The script will run for example with a cycle of 10 s (as
collectd
did) to give trends, no more, no less.
Then it’s up to you to read the log to search for correlation if trends
are
curious, it’s just monitoring! But (I hope) it may be a good help for
people
having many servers running Nginx.

Best regards.

FranSSSSois_Battail · May 1, 2008, 11:52am

François Battail ha scritto:

I don’t think so. Each variable will be correct and is not supposed to be
related to another one (this is a false assertion if your point of view is
debugging).

Each variable will be correct, but since you are writing the value to a
shared memory location not atomically, the variable as read by the
client will not be correct.

The script will run for example with a cycle of 10 s (as collectd
did) to give trends, no more, no less.

This is, IMHO, wrong in principle.
First of all it does not matter if the example read the shared memory
each 10 s.

It can happen that when the client is reading the shared memory, Nginx
concurrently writes this memory.

Moreover it really does not make sense to write in a shared memory if
you want to support monitoring tools!

Just write a module like the stub_status module!

Finally there is one last problem.
Nginx uses only a few shared variables (the one handled the the
stub_status module).

All other variables are private to each worker process.
This means that probably your shared memory location will contain values
not very useful.

Then it’s up to you to read the log to search for correlation if trends are
curious, it’s just monitoring! But (I hope) it may be a good help for people
having many servers running Nginx.

Best regards.

Regards Manlio P.

FranSSSSois_Battail · May 1, 2008, 8:47pm

On Thu, May 01, 2008 at 11:39:50AM +0200, Manlio P. wrote:

Each variable will be correct, but since you are writing the value to a
shared memory location not atomically, the variable as read by the
client will not be correct.

Why? Reading several variables at once will not be atomic (unless
protected by a lock) but individual reads will, as long as the variables
in shm will fit into a sig_atomic_t each (e.g. 4 bytes on x86).

It can happen that when the client is reading the shared memory, Nginx
concurrently writes this memory.

The worst you can expect is that e.g. “requests ok” + “requests failed”
won’t be equal to “total requests”. A sig_atomic_t is effectively atomic
on most architectures (don’t know whether POSIX guarantees it
everywhere).

Moreover it really does not make sense to write in a shared memory if
you want to support monitoring tools!

Just write a module like the stub_status module!

… which has to get its data from somewhere so it’s going to face
essentially the same problem.

Finally there is one last problem.
Nginx uses only a few shared variables (the one handled the the
stub_status module).

All other variables are private to each worker process.
This means that probably your shared memory location will contain values
not very useful.

You could always add more information to a shared segment (like
upstream_fair does).

Best regards,
Grzegorz N.

FranSSSSois_Battail · May 1, 2008, 2:52pm

Le jeudi 01 mai 2008 Ã 11:39 +0200, Manlio P. a Ã©crit :

Each variable will be correct, but since you are writing the value to a
shared memory location not atomically, the variable as read by the
client will not be correct.

Yes, I’m aware of this, but at this time I don’t know which system calls
to use to be compatible with other OSes, so assume there will be a
critical section (for example a mutex/spin lock in shared memory) and
that the script must use it prior reading data.

Moreover it really does not make sense to write in a shared memory if
you want to support monitoring tools!

Just write a module like the stub_status module!

Sorry, but I do not understand what you mean. I just want to provide
something like /proc (on Linux) which is used by many monitoring tools.

Finally there is one last problem.
Nginx uses only a few shared variables (the one handled the the
stub_status module).

All other variables are private to each worker process.
This means that probably your shared memory location will contain values
not very useful.

I agree. It’s why I provide a very simple API to help migrating useful
data to the monitoring area, of course it’s not transparent and
variables need to be ngx_atomic_t, not private. But that way it will be
easy to add monitoring support if needed.

For example gzip compress ratio can be compared to CPU use in mid/long
term; should I use gzip level 9 or 1 for my application?
It could be done if modules provide the possibility to monitor useful
variables if needed. It’s a framework with a very low hit on
performance; I can change the API in such way that if monitoring is not
enabled, then the module will know that it should use private variable
instead of ngx_atomic_t, but I agree it’s not completly transparent.

Thank you for your input, I’m listening and it’s open

Best regards.

FranSSSSois_Battail · May 1, 2008, 9:05pm

Grzegorz N. ha scritto:

concurrently writes this memory.

The problem is not with with Nginx, but with the client that will read
the variables formatted in the shared memory.

The monitoring module for Nginx will write to the file this data, as an
example (the numbers are in string format):
accept:00000002
read:000000001
write:000000001
wait:000000001
mybackendserver1status:00000000
mybackendserver2status:00000001

The worst you can expect is that e.g. “requests ok” + “requests failed”
won’t be equal to “total requests”. A sig_atomic_t is effectively atomic
on most architectures (don’t know whether POSIX guarantees it
everywhere).

Moreover it really does not make sense to write in a shared memory if
you want to support monitoring tools!

Just write a module like the stub_status module!

… which has to get its data from somewhere so it’s going to face
essentially the same problem.

No.
With an HTTP interface the client will just make a request to obtain the
data.

With shared memory it is a mess, I can’t see any good reason to prefer
this solution.

Yes, but this means to change in a not trivial way the existing code in
Nginx.

Best regards,
Grzegorz N.

Regards Manlio P.

FranSSSSois_Battail · May 2, 2008, 11:09am

FranÃ§ois Battail ha scritto:

The problem with this is that the script can arbitrarily block Nginx if
it holds the lock for too much time.

Moreover it really does not make sense to write in a shared memory if
you want to support monitoring tools!

Just write a module like the stub_status module!

Sorry, but I do not understand what you mean. I just want to provide
something like /proc (on Linux) which is used by many monitoring tools.

Ok, but I think that providing a file system interface is not the better
solution.

If you want to monitor global variables, then you can use the
stub_status module (maybe adding new global shared variables).

If you want to monitor things like gzip compression ratio, then just
implement a custom variable $gzip_ratio that the user can use in the log
file.

[…]

Regards Manlio Perilo

FranSSSSois_Battail · May 2, 2008, 4:03pm

The problem with this is that the script can arbitrarily block Nginx if
it holds the lock for too much time.

I will not call a sem_wait() but a sem_trywait() of course! If Nginx
cannot write because a script hold the semaphore then the script will
read the old values, I don’t see an issue.

Ok, but I think that providing a file system interface is not the better
solution.

If you want to monitor global variables, then you can use the
stub_status module (maybe adding new global shared variables).

Stub_status works but it’s not the cleanest code in Nginx and there’s no
simple way to extend the variables watched since you need to modify
specifically other modules using blocks of conditionnal compilation. If
you modify stub_status you potentially break Collectd and Nagios
plugins.
A file system interface is universal and means it will be easy to use
whatever tool you want. A monitoring agent written in C will be happy to
read a file, a little bit less happy if a www library or executing
wget ï»¿is needed to fetch data.

That’s why I propose two things:

A generic interface for monitoring agents

The easiest one: a file-like and a list of key:value, even if the
monitoring agent doesn’t know the semantic of the key it can report back
the value and a graph can be made. Of course it’s possible to modify
stub_status (and to break compatibility) to do the same things but it
will be of no help for point 2. Don’t know today if it will be a shared
memory or a regular file mmaped (file locking on unices is a complete
mess ).

An API

An API for other modules to help providing variables for monitoring. At
the cost of an indirection it may be possible at runtime to choose if
this variable is monitored and then to do an atomic_t operation or not.
If a module offer some variables for monitoring the user can choose or
not to monitor. That’s value for the software and for the user.

The API could be as simple as:

ngx_monitoring_value_t *
ngx_register_monitoring_value
(ngx_str_t * name, ngx_str_t * command_name, ngx_int_t option) ;

void
ngx_monitoring_value_add
(ngx_monitoring_value_t * value, ngx_int_t nbr) ;

void
ngx_monitoring_value_set
(ï»¿ngx_monitoring_value_t * value, ngx_int_t new_value) ;

For example, in the case of the upstream server round robin module, code
would be like this (pseudo code):

init:
servers = array of ngx_monitoring_value_t * [nbr_servers]
for each upstream server
servers [i] = ngx_register_monitoring_value
(“upstream-status-”+server_name [i], “upstream_server_status”,0) ;
…

run:
if (event == down)
{
…
ï»¿ ngx_monitoring_value_set (server [i], 0) ;
}
else if (event == up)
{
…
ngx_monitoring_value_set (server [i], 1) ;
}

Just put “monitor upstream_server_status ;” in nginx.conf and my module
will do all the atomic_t stuff else it will use normal operations.
Cost at runtime: one function call and a conditionnal per variable…

If you want to monitor things like gzip compression ratio, then just
implement a custom variable $gzip_ratio that the user can use in the log
file.

OK, gzip ratio was not the best real life example but imagine you
have MRTG graphs and important values in the log, you ran stress tests
for 24 h, 1.4 10^9 requests later, the error log is 100 MB long, looks
like a nightmare to exploit the log to correlate with load for example,
isnt it?

Just a different example where I want logging and monitoring. I’ve a
special Nginx module with a circular buffer used to communicate with
threads. If there’s a buffer overflow, I log it, but it would be nice
(for me) to have a circular buffer overflow error counter included in
the monitoring watch set. Of course I can “hack” stub_status and
collectd plugin, but it’s better to propose a more general solution
without breaking anything and with no significant performance hit.

Thank you very much for your time and your input Manlio, even if we
don’t agree on some points, it is very stimulating for me to have a
contradictor such as you.

Best regards.