Re: [Nagios-devel] A different way?
Posted: Mon Oct 12, 2009 6:28 pm
Apologies for replying to this thread rather late, but I figured I should s=
peak up, as someone who has implemented a distributed design. More apologie=
s for hellish Outlook quoting, which I have attempted to make legible
=20
-----Original Message-----
From: Andreas Ericsson [mailto:[email protected]]=20
>On 09/25/2009 01:05 AM, Steven D. Morrey wrote:
>> The checks are already executing on the local machine, so how about a
>> daemon on each machine, the daemon would keep the schedule and
>> execute service checks locally, processing the result and returning
>> the results and the required actions (based on a local policy) to
>> nagios which would then do the actual work of handling notifications
>> etc and so forth. This way nagios could be an auditor, if it doesn't
>> receive a result on time as expected, then it could query the daemon
>> to see whats gone wrong, if that fails then it could initiate a host
>> check, etc.
I see 2 or 3 major differences between your proposal and the current passiv=
e schemes:
- Nagios can more easily poke "lost" systems (you can do this now with UNKN=
OWN and some clever notification & escalation configs, or possibly with obs=
essing, but it's far more obscure and convoluted)
- If I understand you, you're also proposing pushing the flap detection log=
ic (and possibly more, but determining what else has no off-host dependenci=
es is difficult - dependency checks would need to be central, for example)
- It would be possible for Nagios to act as a configuration management syst=
em for the monitoring config of the remote nodes, instead of requiring some=
outboard system
> Nagios still needs to retain the ability to execute checks on its
> own, or it won't be able to monitor things like routers and switches.
No, it doesn't. You can monitor those things via plug-ins that run on worke=
r nodes. This is _especially_ important for things like latency monitoring,=
where you may want your probe point to be a different place on the network=
than you Nagios server.
> The two important savings can be had anyway by simply adding
> more systems, and that doesn't involve modifying the monitored
> systems at all (unless one wants to install a local agent to
> get more detailed monitoring data, ofcourse). Networks that are
> large enough to require multiple Nagios servers are almost
> invariably owned by large corporations which have no qualms at
> all about paying an additional $5.000 for a new server, but
> often have policies and laws regulating what kind of software
> they're allowed to run on their systems.
> I think we'll gain very, very little by moving down this road.
> Should we decide, at some point in the future, that it's a good
> thing to do, I'm sure the Merlin protocol can be (ab)used to
> make such a daemon workable though.
Speaking as someone that actually works at one of those "large corporations=
" (and has worked at several others), You're smoking crack. We care deeply =
about bad scaling, and are not willing to buy 100 servers (not an exaggerat=
ion for 2.x, probably more like 20-40 servers for 3.x) to fix bad code desi=
gn. If I hadn't written a passive check framework, we would never have been=
able to deploy Nagios.
> Communication has overhead. DNX doesn't scale up linearly with the
> number of poller hosts you add, and neither does Merlin. With the
> amount of communication, and the number of servers involved in the
> networks we're talking about here, I'm highly skeptical that this
> approach will work very well at all. Basically, anything that
> involves more than 500 poller nodes will be tricky to maintain
> due to the sheer amount of connections the master system is
> required to maintain one way or another.
We currently have well over 3,000 "poller nodes" per Nagios instance, with =
multiple instances running on a single server. The communication overhead i=
s trivial compared to the savings. Note that Nagios communicates _zero_ sta=
tus to my pollers, all communication is in the other direction. A more enta=
ngled design (which this appears to be) would, indeed, have
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ndreas Ericsson [mailto:[email protected]]=2
peak up, as someone who has implemented a distributed design. More apologie=
s for hellish Outlook quoting, which I have attempted to make legible
-----Original Message-----
From: Andreas Ericsson [mailto:[email protected]]=20
>On 09/25/2009 01:05 AM, Steven D. Morrey wrote:
>> The checks are already executing on the local machine, so how about a
>> daemon on each machine, the daemon would keep the schedule and
>> execute service checks locally, processing the result and returning
>> the results and the required actions (based on a local policy) to
>> nagios which would then do the actual work of handling notifications
>> etc and so forth. This way nagios could be an auditor, if it doesn't
>> receive a result on time as expected, then it could query the daemon
>> to see whats gone wrong, if that fails then it could initiate a host
>> check, etc.
I see 2 or 3 major differences between your proposal and the current passiv=
e schemes:
- Nagios can more easily poke "lost" systems (you can do this now with UNKN=
OWN and some clever notification & escalation configs, or possibly with obs=
essing, but it's far more obscure and convoluted)
- If I understand you, you're also proposing pushing the flap detection log=
ic (and possibly more, but determining what else has no off-host dependenci=
es is difficult - dependency checks would need to be central, for example)
- It would be possible for Nagios to act as a configuration management syst=
em for the monitoring config of the remote nodes, instead of requiring some=
outboard system
> Nagios still needs to retain the ability to execute checks on its
> own, or it won't be able to monitor things like routers and switches.
No, it doesn't. You can monitor those things via plug-ins that run on worke=
r nodes. This is _especially_ important for things like latency monitoring,=
where you may want your probe point to be a different place on the network=
than you Nagios server.
> The two important savings can be had anyway by simply adding
> more systems, and that doesn't involve modifying the monitored
> systems at all (unless one wants to install a local agent to
> get more detailed monitoring data, ofcourse). Networks that are
> large enough to require multiple Nagios servers are almost
> invariably owned by large corporations which have no qualms at
> all about paying an additional $5.000 for a new server, but
> often have policies and laws regulating what kind of software
> they're allowed to run on their systems.
> I think we'll gain very, very little by moving down this road.
> Should we decide, at some point in the future, that it's a good
> thing to do, I'm sure the Merlin protocol can be (ab)used to
> make such a daemon workable though.
Speaking as someone that actually works at one of those "large corporations=
" (and has worked at several others), You're smoking crack. We care deeply =
about bad scaling, and are not willing to buy 100 servers (not an exaggerat=
ion for 2.x, probably more like 20-40 servers for 3.x) to fix bad code desi=
gn. If I hadn't written a passive check framework, we would never have been=
able to deploy Nagios.
> Communication has overhead. DNX doesn't scale up linearly with the
> number of poller hosts you add, and neither does Merlin. With the
> amount of communication, and the number of servers involved in the
> networks we're talking about here, I'm highly skeptical that this
> approach will work very well at all. Basically, anything that
> involves more than 500 poller nodes will be tricky to maintain
> due to the sheer amount of connections the master system is
> required to maintain one way or another.
We currently have well over 3,000 "poller nodes" per Nagios instance, with =
multiple instances running on a single server. The communication overhead i=
s trivial compared to the savings. Note that Nagios communicates _zero_ sta=
tus to my pollers, all communication is in the other direction. A more enta=
ngled design (which this appears to be) would, indeed, have
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ndreas Ericsson [mailto:[email protected]]=2