Re: [Nagios-devel] A different way?

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] A different way?

Post by Guest »

Apologies for replying to this thread rather late, but I figured I should s=
peak up, as someone who has implemented a distributed design. More apologie=
s for hellish Outlook quoting, which I have attempted to make legible :-(=20

-----Original Message-----
From: Andreas Ericsson [mailto:[email protected]]=20

>On 09/25/2009 01:05 AM, Steven D. Morrey wrote:
>> The checks are already executing on the local machine, so how about a
>> daemon on each machine, the daemon would keep the schedule and
>> execute service checks locally, processing the result and returning
>> the results and the required actions (based on a local policy) to
>> nagios which would then do the actual work of handling notifications
>> etc and so forth. This way nagios could be an auditor, if it doesn't
>> receive a result on time as expected, then it could query the daemon
>> to see whats gone wrong, if that fails then it could initiate a host
>> check, etc.

I see 2 or 3 major differences between your proposal and the current passiv=
e schemes:
- Nagios can more easily poke "lost" systems (you can do this now with UNKN=
OWN and some clever notification & escalation configs, or possibly with obs=
essing, but it's far more obscure and convoluted)
- If I understand you, you're also proposing pushing the flap detection log=
ic (and possibly more, but determining what else has no off-host dependenci=
es is difficult - dependency checks would need to be central, for example)
- It would be possible for Nagios to act as a configuration management syst=
em for the monitoring config of the remote nodes, instead of requiring some=
outboard system

> Nagios still needs to retain the ability to execute checks on its
> own, or it won't be able to monitor things like routers and switches.

No, it doesn't. You can monitor those things via plug-ins that run on worke=
r nodes. This is _especially_ important for things like latency monitoring,=
where you may want your probe point to be a different place on the network=
than you Nagios server.

> The two important savings can be had anyway by simply adding
> more systems, and that doesn't involve modifying the monitored
> systems at all (unless one wants to install a local agent to
> get more detailed monitoring data, ofcourse). Networks that are
> large enough to require multiple Nagios servers are almost
> invariably owned by large corporations which have no qualms at
> all about paying an additional $5.000 for a new server, but
> often have policies and laws regulating what kind of software
> they're allowed to run on their systems.

> I think we'll gain very, very little by moving down this road.
> Should we decide, at some point in the future, that it's a good
> thing to do, I'm sure the Merlin protocol can be (ab)used to
> make such a daemon workable though.

Speaking as someone that actually works at one of those "large corporations=
" (and has worked at several others), You're smoking crack. We care deeply =
about bad scaling, and are not willing to buy 100 servers (not an exaggerat=
ion for 2.x, probably more like 20-40 servers for 3.x) to fix bad code desi=
gn. If I hadn't written a passive check framework, we would never have been=
able to deploy Nagios.

> Communication has overhead. DNX doesn't scale up linearly with the
> number of poller hosts you add, and neither does Merlin. With the
> amount of communication, and the number of servers involved in the
> networks we're talking about here, I'm highly skeptical that this
> approach will work very well at all. Basically, anything that
> involves more than 500 poller nodes will be tricky to maintain
> due to the sheer amount of connections the master system is
> required to maintain one way or another.

We currently have well over 3,000 "poller nodes" per Nagios instance, with =
multiple instances running on a single server. The communication overhead i=
s trivial compared to the savings. Note that Nagios communicates _zero_ sta=
tus to my pollers, all communication is in the other direction. A more enta=
ngled design (which this appears to be) would, indeed, have

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ndreas Ericsson [mailto:[email protected]]=2
Locked