Re: [Nagios-devel] Core 4 Remote Workers

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Core 4 Remote Workers

Post by Guest »

On 02/02/2013 03:12 PM, Eric Stanley wrote:
> All,
>
> I've been giving some thought to remote workers for core 4 and wanted to
> run those thoughts by this list. I see remote workers as a very useful
> extension to the worker concept in core 4.
>
> To implement remote workers, I think there are about 4 basic things that
> would need to be done.
> 1. Implement the ability to listen to multiple query handler interfaces
> (precursor to #2)

This is trivial. Simply create an additional socket and set it up with
the iobroker the exact same way everything else is handled.

> 2. Implement the ability to create and listen on TCP socket query
> handler interfaces.

This is also trivial, and the name "nsock_unix()" sort of suggests that
there will be an "nsock_inet()" coming along to keep it company (which
has been the thought all along).

However, I've always intended for that to be a separate daemon, which
can live in a chroot jail and only forward requests to the main Nagios
daemon that it knows is kosher. That would keep us from having to do
all the input validation and such in the core.

> 3. Add a host key to the worker registration to allow workers to specify
> the host(s) for which it will handle checks.

Not really difficult, although I suspect one will want to use groups
instead of specific hosts, and also use the address which the other
node is connecting from as the host to monitor (so one can have self-
monitoring servers that phone in to Nagios with their results).

> 4. Write a stand-alone remote worker that can connect to the core
> instance via TCP.
>

Trivial, since lib/worker.c contains 99% of the code needed to write a
worker.

> The reason I have steps 1 and 2, instead of combining them is first,
> because a generalized solution is more extensible and second, I think
> having multiple TCP listeners is a reasonable use case where you have a
> multi-homed system, but you may not want to listen on all interfaces.
>

That can be firewalled away quite trivially, so no need for us to handle
that with code that might break (as I suspect it will see little testing).

> The host key should be allowed to specify one or more IP addresses, IP
> subnets, contiguous IP address ranges, host names and host name
> patterns/wildcards (i.e. *.example.com). If multiple workers register
> for the same host, some sort of distribution mechanism should be used to
> load balance the workers.
>

Umm... Is this what the remote worker should request? If so, we're doing
a pretty major change in Nagios where a hosts address is always just a
string that we pass to the plugins, and it won't be long until people
start requesting regex matching, subdomain matching and whatnot for it,
and we'll have to start resolving hostnames.

I'd say just go with hostgroups instead. It's easier, and people will
have to do some minor configuring of remote workers anyway, so saying
"hostgroups=core-routers" in that config in addition to ip and port
to Nagios isn't such a big chore.

> Using the second criteria of host to determine which worker gets the
> check raises the question of the order of precedence for the criteria.
> Initially, I think the host should have precedence over plugin, but I
> can see implementing and order of precedence option in the core
> configuration file. This would be more important if additional worker
> selection criteria were added.
>

Object over check type, any day. We may have to add a "check_type" thing
to command objects though, so workers can register for only local checks
and still have their http checks and whatnot done from remote, where
they make more sense. This requires some thinking.

> The communication between the remote worker and the core process should
> be able to be protected by SSL. The remote worker will need a mechanism
> to retry the connection in the event the network drops the connection.
>

Retrying the connection is the easy part. What should it do with the
jobs its running while the upstream connection is dead? More importantly,
how should core Nagios react to the checks it's supposed to run when the

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ae@op5.se
Locked