Page 1 of 1

Re: [Nagios-devel] RFC/RFP Nagios command workers

Posted: Mon May 23, 2011 9:38 am
by Guest
On 05/23/2011 11:37 AM, Matthieu Kermagoret wrote:
>
>> The idea to solve all of that is to fork() off a set of worker
>> threads at startup that free()'s all possible memory and re-connects
>> to the master process via a unix domain socket (or network socket
>> that by default only listens to the localhost address) to receive
>> requests to run commands and return the results of those commands.
>>
>
> While I agree that distributing check execution among multiple
> processes can be a really good idea, I don't know if this should be
> implemented in the Core. This can add significant complexity to the
> code while not being useful to all Nagios users. The Core already have
> a proper API that allows modules to execute checks themselves, so why
> not rely on it for distribution and improve the existing command
> execution mechanism ?
>

Because shipping an official module that does it would mean not only
supporting the old complexity, but also the new one. Having a single
default system for running checks would definitely be preferrable to
supporting multiple ones.

> As you say, one of the root problem of the current implementation, is
> the use of temporary files, as this consumes much I/O when writing,
> scanning and reading them. Also the Nagios Core process is fork()ed
> multiple times and this might consume unnecessary CPU time. So I
> propose the following :
>
> 1) Remove the multiple fork system to execute a command. The Nagios
> Core process forks directly the process that will exec the command
> (more or less sh's parsing of command line, don't really know if this
> could/should be integreted in the Core).
>

This really can't be done without using multiple threads since the
core can't wait() for input and children while at the same time
issuing select() calls to multiplex the new output of currently
running checks.


> 2) The root process and the subprocess are connected with a pipe() so
> that the command output can be fetched by reading the pipe. Nagios
> will maintain a list of currently running commands.
>

Pipes are limited in that they only guarantee 512 bytes of atomic
writes and reads. TCP sockets don't have this problem. There's also
the fact that a lot of modules already use sockets, so we can get
rid of a lot of code in those modules and let them re-use Nagios'
main select() loop and get inbound events on "their" sockets as a
broker callback event. Much neater that way.

> 3) The event loop will multiplex processes' I/O and process them as necessary.
>

That's what the worker processes will do and then feed the results
back to the nagios core through the sequential socket, which will
guarantee read and write operations large enough to never truncate
any of the data necessary for the master process to do proper book-
keeping.

>> This has several benefits, although they're not immediately user
>> visible.
>> * I/O load will decrease significantly, leaving more disk throughput
>> capacity for performance data graphing or status data database
>> solutions.
>
> Still holds but to a smaller extent, as the "problem of Nagios using a
> lot more copied memory per fork than it's supposed to" is not solved.
> This could be solved with a module however, see below.
>

Not without the module also running external programs, which just means
more complexity inside the nagios core instead of less.

>> * Scripting languages can be embedded regardless of memory leaks and
>> whatnot, since worker daemons can be killed off and respawned every
>> 50000 checks (or something), thus causing the kernel to clean up
>> any and all leaked memory.
>
> There could be modules that override checks and forward them to
> interpreter daemons on a per-language basis for example.
>

Yup. I'd expect this to be a natural progression of how things work,
with Python being the first in queue to be embedded.

>> * Nagios core can be single-threaded, which means higher portability,
>> less memory usage and more robust code.
>
> Still holds.
>

Nope. It fails for all modules that require constantly poll()'ed
sockets.

>> * Eventbrok

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]