[Nagios-devel] passive on-demand host checks being converted from

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] passive on-demand host checks being converted from

Post by Guest »


I've been seeing a problem with on-demand host checking since we moved
to a distributed setup. We're running Nagios 3.0.2 with a central
server that does virtually no checks. All checks are performed by 2
other distributed servers.

I have an example situation here where the distributed node detects a
service failure then host failure. On the distributed node, I see:

Host Down[07-07-2008 15:30:44] HOST ALERT:
mfrost_win;DOWN;HARD;10;FPING CRITICAL - PB9700DL1JDGHD1.corp.pep.pvt
(loss=3D100% )
Host Down[07-07-2008 15:29:42] HOST ALERT:
mfrost_win;DOWN;SOFT;9;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:28:40] HOST ALERT:
mfrost_win;DOWN;SOFT;8;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:27:38] HOST ALERT:
mfrost_win;DOWN;SOFT;7;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:26:36] HOST ALERT:
mfrost_win;DOWN;SOFT;6;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:25:34] HOST ALERT:
mfrost_win;DOWN;SOFT;5;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:24:32] HOST ALERT:
mfrost_win;DOWN;SOFT;4;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:23:30] HOST ALERT:
mfrost_win;DOWN;SOFT;3;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:22:28] HOST ALERT:
mfrost_win;DOWN;SOFT;2;FPING CRITICAL - mfrost_win (loss=3D100% )
Service Critical[07-07-2008 15:22:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
Host Down[07-07-2008 15:21:26] HOST ALERT:
mfrost_win;DOWN;SOFT;1;FPING CRITICAL - mfrost_win (loss=3D100% )
Service Critical[07-07-2008 15:21:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.

But for the corresponding set of activities I see the following on the
central/reporting server:

Service Critical[07-07-2008 15:22:29] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
Host Down[07-07-2008 15:21:33] HOST ALERT:
mfrost_win;DOWN;HARD;1;FPING CRITICAL - mfrost_win (loss=3D100% )
Service Critical[07-07-2008 15:21:33] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.

The distributed node seems to do what its supposed to do and continues
to retry up to max_retries (10). When that first (soft) ping failure
gets passed to the central/reporting server, it marks it as a
hard/critical and sends an alert out immediately. Meanwhile the
distributed node continues checking for a while until it determines that
the state of the host is hard/critical.

The settings for this host are as follows:

central server:
max_check_attempts 10
check_interval 0
retry_interval 1
obsess_over_host 0
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
freshness_threshold 1200

distributed node:
max_check_attempts 10
check_interval 0
retry_interval 1
obsess_over_host 1
active_checks_enabled 1
passive_checks_enabled 0
check_freshness 0
freshness_threshold 1200


Everything else works fine monitoring-wise, but this problem has been
bugging me for months now. I'm at that crossroads where I'm trying to
determine if this is a bug or if I'm doing something wrong that I can't
figure out. As far as I can glean from the documentation, this isn't
how this is supposed to work given the way I've configured things.

Thanks

Mark






This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked