[Nagios-devel] passive on-demand host checks being converted from

Guest · Post by **Guest** » Mon Jul 07, 2008 12:38 pm

I've been seeing a problem with on-demand host checking since we moved
to a distributed setup. We're running Nagios 3.0.2 with a central
server that does virtually no checks. All checks are performed by 2
other distributed servers.

I have an example situation here where the distributed node detects a
service failure then host failure. On the distributed node, I see:

Host Down[07-07-2008 15:30:44] HOST ALERT:
mfrost_win;DOWN;HARD;10;FPING CRITICAL - PB9700DL1JDGHD1.corp.pep.pvt
(loss=3D100% )
Host Down[07-07-2008 15:29:42] HOST ALERT:
mfrost_win;DOWN;SOFT;9;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:28:40] HOST ALERT:
mfrost_win;DOWN;SOFT;8;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:27:38] HOST ALERT:
mfrost_win;DOWN;SOFT;7;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:26:36] HOST ALERT:
mfrost_win;DOWN;SOFT;6;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:25:34] HOST ALERT:
mfrost_win;DOWN;SOFT;5;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:24:32] HOST ALERT:
mfrost_win;DOWN;SOFT;4;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:23:30] HOST ALERT:
mfrost_win;DOWN;SOFT;3;FPING CRITICAL - mfrost_win (loss=3D100% )
Host Down[07-07-2008 15:22:28] HOST ALERT:
mfrost_win;DOWN;SOFT;2;FPING CRITICAL - mfrost_win (loss=3D100% )
Service Critical[07-07-2008 15:22:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
Host Down[07-07-2008 15:21:26] HOST ALERT:
mfrost_win;DOWN;SOFT;1;FPING CRITICAL - mfrost_win (loss=3D100% )
Service Critical[07-07-2008 15:21:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.

But for the corresponding set of activities I see the following on the
central/reporting server:

Service Critical[07-07-2008 15:22:29] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
Host Down[07-07-2008 15:21:33] HOST ALERT:
mfrost_win;DOWN;HARD;1;FPING CRITICAL - mfrost_win (loss=3D100% )
Service Critical[07-07-2008 15:21:33] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.

The distributed node seems to do what its supposed to do and continues
to retry up to max_retries (10). When that first (soft) ping failure
gets passed to the central/reporting server, it marks it as a
hard/critical and sends an alert out immediately. Meanwhile the
distributed node continues checking for a while until it determines that
the state of the host is hard/critical.

The settings for this host are as follows:

central server:
max_check_attempts 10
check_interval 0
retry_interval 1
obsess_over_host 0
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
freshness_threshold 1200

distributed node:
max_check_attempts 10
check_interval 0
retry_interval 1
obsess_over_host 1
active_checks_enabled 1
passive_checks_enabled 0
check_freshness 0
freshness_threshold 1200

Everything else works fine monitoring-wise, but this problem has been
bugging me for months now. I'm at that crossroads where I'm trying to
determine if this is a bug or if I'm doing something wrong that I can't
figure out. As far as I can glean from the documentation, this isn't
how this is supposed to work given the way I've configured things.

Thanks

Mark

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]