State of hosts with passive checks

DanielB · Post by **DanielB** » Thu Sep 18, 2014 2:39 pm

Hi all,

I recently implemented a passive checks schema on a remote site. In this scenario I have a host with dynamic IP (accessible via DynDNS domain) which performs the checks on other hosts of the local network and sends the results to the Nagios server via NSCA.

Here I have observed the following situation: if this host is turned off, the Nagios server returns a message for services (in the same host and all who monitors): "WARNING: Did not receive service status report for a long time!" using check_freshness, freshness_threshold and no_report_warn command in the service definition:

Code: Select all

define command{
        command_name    no_report_warn
        command_line    $USER1$/no_report_warn.sh
        }

Code: Select all

#!/bin/bash
# file: /usr/local/nagios/libexec/no_report_warn.sh
 
echo "WARNING: Did not receive service status report for a long time!";
exit 1;

Well, this is what I expected to happen with the services if the host is down. But I'm also noting that both the host turned off as those who are passively monitored by this, they are all displayed in Nagios with "Up". The only explanation I can think specifically for the host turned off is that the IP which had this server is now used by another client of DynDNS and so check_ping not fail for this server.

But I'm not sure what can be the cause for the hosts monitored passively appear as "Up". I think I had tried to use check_host_freshness and host_freshness_check_interval in the host definition with no_report_warn on the check_command, but it did not produce any change.

Best regards,
Daniel

DanielB · Post by **DanielB** » Thu Sep 18, 2014 2:56 pm

DanielB wrote:I think I had tried to use check_host_freshness and host_freshness_check_interval in the host definition with no_report_warn on the check_command, but it did not produce any change.

Sorry, I meant check_freshness and freshness_threshold.

Best regards,
Daniel

DanielB · Post by **DanielB** » Thu Sep 18, 2014 6:23 pm

I have noticed that nagios.cfg have:

Code: Select all

check_host_freshness=0

so I've set it to 1. With this host definition:

Code: Select all

define host {
  use generic-switch-external
  host_name CP1
  alias CP1
  address 192.168.10.2
  parents NSCA-site1
  icon_image cook/switch.gif
  statusmap_image cook/switch.gd2
  notification_interval 0          ; Only send notifications on status change by default.
  passive_checks_enabled 1
  active_checks_enabled 0
  ; DGB - 20140918
  check_freshness       1
  freshness_threshold   1800
  check_command         no_report_warn
}

I noticed in the Nagios log that it detects the change in the "freshness":

Code: Select all

[1411080871] Warning: The results of host 'CP1' are stale by 0d 0h 1m 0s (threshold=0d 0h 30m 0s).  I'm forcing an immediate check of the host.

But in "Host State Information" I see:

Code: Select all

Host Status:       	UP  (for 13d 4h 18m 48s)
Status Information:	WARNING: Did not receive service status report for a long time!

That is, given the warning message, the script is running; but the state remains in "UP". I thought the state would have to be different if the script returns a value of "1".

Best regards,
Daniel

DanielB · Post by **DanielB** » Fri Sep 19, 2014 8:43 am

DanielB wrote: I noticed in the Nagios log that it detects the change in the "freshness":
Code: Select all
[1411080871] Warning: The results of host 'CP1' are stale by 0d 0h 1m 0s (threshold=0d 0h 30m 0s).  I'm forcing an immediate check of the host.
But in "Host State Information" I see:
Code: Select all
Host Status:       	UP  (for 13d 4h 18m 48s)
Status Information:	WARNING: Did not receive service status report for a long time!

I was reading the Nagios Plugin API document and here I found the answer to my question:

If the use_aggressive_host_checking option is enabled, return codes of 1 will result in a host state of DOWN or UNREACHABLE. Otherwise return codes of 1 will result in a host state of UP. The process by which Nagios determines whether or not a host is DOWN or UNREACHABLE is discussed here.

Then I changed the script now using return value of "3": UNKNOWN (services) and UNREACHABLE (hosts). It seemed most appropriate for the situation where the host sending passive checks is turned off.

Code: Select all

define command{
        command_name    no_report_unknown
        command_line    $USER1$/no_report_unknown.sh
        }

Code: Select all

#!/bin/bash
# file: /usr/local/nagios/libexec/no_report_unknown.sh
 
echo "UNKNOWN: Did not receive service status report for a long time!";
exit 3;

Best regards,
Daniel

sreinhardt · Post by **sreinhardt** » Fri Sep 19, 2014 1:44 pm

Glad you got it all figured out, that looks like you made all the correct changes! One thing I would note, unrelated to your actual issue, is that we check things on the forum by oldest to newest. So multi-posting like you did here, will actually cause you to keep moving further down the list. Just wanted to make sure you were aware, and understood that generally we suggest editing previous posts, so that we can keep you in the correct spot on our list and get you a timely response!

Nagios Support Forum

State of hosts with passive checks

State of hosts with passive checks

Re: State of hosts with passive checks

Re: State of hosts with passive checks

Re: State of hosts with passive checks

Re: State of hosts with passive checks