Re: [Nagios-devel] Removing host checks for non-OK passive results

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Removing host checks for non-OK passive results

Post by Guest »

This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.

--0-1863220330-1148461520=:14662
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; FORMAT=flowed
Content-ID:

On Tue, 23 May 2006, Ton Voon wrote:

> On 19 May 2006, at 19:06, Bruce Campbell wrote:
>> Or more precisely, the host may well be 'down' from one monitoring node's
>> point of view, and 'up' from another monitoring node's point of view. Imho,
>> each monitoring node should maintain its own idea of host's up/down state,
>> and not send/accept host check results between themselves. Service check
>> results are a different issue.
>
> We setup distributed monitoring across internationally spread datacenters.
> With firewall policies, only the local monitoring server can ping their local
> hosts. Thus the central monitoring server really has no idea about whether a
> node is up or down - it has to rely on the slave monitoring server.

My own distributed setup doesn't have a central monitoring server, and all
nodes are assumed to be able to ping all monitored hosts. Theres some
other magic happening to ensure that only one notification is sent per
event and that nsca doesn't loop on the same check result.

>> Ideally, Nagios just runs one host check after the first non-OK service
>> result comes in, and uses the cached value as long as it is within the
>> host's freshness_threshold. Otherwise, your check_latency for everything
>> goes way up, and you eventually write your own scheduler out of irritation
>> at seeing service checks being executed at 5 hour intervals.
>
> Hmm, not sure about writing your own scheduler :)

Attached, together with the patches required for Nagios::Config
(Nagios::Object 0.08). Even follows most dependencies, although you could
probably craft a configuration that would break this without too much
effort.

I have this running on my hosts for just the host checks at the present
time, pending some tuits to track down some weird service check
interactions caused by leaving check_freshness enabled.

One obvious gotcha with it at the moment is that the first execution of it
as the Nagios (host|service)_perfdata_processing_command is
Nagios_starttime + (host|service)_perfdata_processing_interval, not
Nagios_starttime. If the interval is a long time span to avoid the cpu
load of perl parsing the config file, Nagios won't receive any results for
that period of time.

> We considered using a "cache" value for a host status - I think the idea has
> merit and would reduce a large number of host checks, especially if something
> suddenly happened to a large set of services on one host. However, we baulked
> at going ahead because there's bound to be some subtle situation where this
> would be undesireable.

See the "Workaround for 'Host DOWN' false-positives" thread for another
way of doing it (slurp in the entire status.dat file if you've got a small
installation, submit passive host check results from a service check if
you've got a large installation). Both have the advantage of being driven
by Nagios.

On further consideration, there is another subtle niggle in Nagios which
would stop this from reliably working for the initial
max_service_check_spread time. You can see this niggle in action when you
start up Nagios, and watch how long it takes for a service with a
relatively low check interval to be executed. If you're unlucky, the
first execution of it will be several multiples of its check interval
after Nagios has been started, and you will have seen its 'Next Check'
time change several times.

> If the idea is validated through this thread (seems like the best way to test
> a design!), then we maybe able to subsidise the development of it at
> Altinity.

--
Bruce Campbell

Freelance admin, coder and cynic. Pessimistic commentary with
sprinklings of sardonic humour a speciality.
--0-1863220330-1148461520=:14662
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; NAME="run_background_checks.pl"
Cont

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked