Running Nagios XI 5.9.2
Is it expected that an active service check will be automatically disabled by Nagios when a command (in this case, NRPE) times out?
Over time, active checks have been automatically disabled for a not-insignificant % of my monitored hosts. The service check status for those sites still shows Ok/Green in the dashboard, but "Service Status Detail" shows Next Check is set to "Not Scheduled", and Active Check Flag is Disabled. If I enable active checks, the next check gets scheduled and things resume as expected (for a while).
nagios.log shows this:
[1683717541] SERVICE ALERT: host2;sit_devices;UNKNOWN;HARD;1;(No output on stdout) stderr: connect to address 10.192.26.1 port 5666: Connection timed out
...
[1683717767] Error: External command failed -> DISABLE_SVC_CHECK;host2;sit_devices
I interpret this to mean that because the active check command timed out once, Nagios has decided to place this host-service check combination in disabled status.
It's not uncommon for our monitored sites to have unexpected network outages or server downtime.
Is there a config setting to prevent the disable of the active check in this command timeout situation? If this is desired behavior to prevent thread overload, is there a way to get Nagios to automatically re-enable checks that get into this state?
This behavior lowers the integrity of our Nagios dashboards, as users stop trusting the statuses being reported.
Thanks in advance,
John
command check timeout disables active checks
Re: command check timeout disables active checks
What is host_down_disable_service_checks set to in nagios.cfg?
You might look at the nagios.cfg doc and search for orphan to see if it might apply, as well as looking for other settings that might be the cause.
Yiu might look at the service check timeout setting in nagios.cfg and make sure that the check_nrpe commands defined have a timeout less than the global timeout.
You might look at the nagios.cfg doc and search for orphan to see if it might apply, as well as looking for other settings that might be the cause.
Yiu might look at the service check timeout setting in nagios.cfg and make sure that the check_nrpe commands defined have a timeout less than the global timeout.
Re: command check timeout disables active checks
Thanks for the response.
Core Config Manager shows the following for check_nrpe command:
$USER1$/check_nrpe -2 -H $HOSTADDRESS$ -u -t 300 -c $ARG1$ $ARG2$ $ARG3$
all monitored hosts, have the same nrpe.cfg file setting
command_timeout=300
connection_timeout=300
Thanks for the direction. I'll dig some.
John
/usr/local/nagios/etc/nagios.cfg has no entry for host_down_disable_service_checks
Will do.
/usr/local/nagios/etc/nagios.cfg has service_check_timeout=500
Core Config Manager shows the following for check_nrpe command:
$USER1$/check_nrpe -2 -H $HOSTADDRESS$ -u -t 300 -c $ARG1$ $ARG2$ $ARG3$
all monitored hosts, have the same nrpe.cfg file setting
command_timeout=300
connection_timeout=300
Thanks for the direction. I'll dig some.
John
Re: command check timeout disables active checks
FYI, I determined the cause of the issue. Figured I'd provide a resolution for others in a similar boat.
The service checks in question, I was converting to ACTIVE checks from PASSIVE checks.
/usr/local/nagios/libexec/eventhandlers/disable-service-checks.sh
- this script fires when a host status changes
- pre-existing custom code in there to pass in external command DISABLE_SVC_CHECK when the host status changes to UP
- I'm not quite sure what DISABLE_SVC_CHECK on a passive check does--this could have been a mistake or intentional
- changed those lines to be ENABLE_SVC_CHECK when host status changes to UP
- may be redundant as there is an ENABLE_HOST_SVC_CHECKS above it
Service checks are staying current now.
The service checks in question, I was converting to ACTIVE checks from PASSIVE checks.
/usr/local/nagios/libexec/eventhandlers/disable-service-checks.sh
- this script fires when a host status changes
- pre-existing custom code in there to pass in external command DISABLE_SVC_CHECK when the host status changes to UP
- I'm not quite sure what DISABLE_SVC_CHECK on a passive check does--this could have been a mistake or intentional
- changed those lines to be ENABLE_SVC_CHECK when host status changes to UP
- may be redundant as there is an ENABLE_HOST_SVC_CHECKS above it
Service checks are staying current now.
-
- Posts: 4
- Joined: Tue Jun 27, 2023 3:50 am
Re: command check timeout disables active checks
I've just installed and configure checkmk agent on some AIX hosts. My issue is the checks are taking too long. car games