command check timeout disables active checks
Posted: Wed May 10, 2023 10:15 am
Running Nagios XI 5.9.2
Is it expected that an active service check will be automatically disabled by Nagios when a command (in this case, NRPE) times out?
Over time, active checks have been automatically disabled for a not-insignificant % of my monitored hosts. The service check status for those sites still shows Ok/Green in the dashboard, but "Service Status Detail" shows Next Check is set to "Not Scheduled", and Active Check Flag is Disabled. If I enable active checks, the next check gets scheduled and things resume as expected (for a while).
nagios.log shows this:
[1683717541] SERVICE ALERT: host2;sit_devices;UNKNOWN;HARD;1;(No output on stdout) stderr: connect to address 10.192.26.1 port 5666: Connection timed out
...
[1683717767] Error: External command failed -> DISABLE_SVC_CHECK;host2;sit_devices
I interpret this to mean that because the active check command timed out once, Nagios has decided to place this host-service check combination in disabled status.
It's not uncommon for our monitored sites to have unexpected network outages or server downtime.
Is there a config setting to prevent the disable of the active check in this command timeout situation? If this is desired behavior to prevent thread overload, is there a way to get Nagios to automatically re-enable checks that get into this state?
This behavior lowers the integrity of our Nagios dashboards, as users stop trusting the statuses being reported.
Thanks in advance,
John
Is it expected that an active service check will be automatically disabled by Nagios when a command (in this case, NRPE) times out?
Over time, active checks have been automatically disabled for a not-insignificant % of my monitored hosts. The service check status for those sites still shows Ok/Green in the dashboard, but "Service Status Detail" shows Next Check is set to "Not Scheduled", and Active Check Flag is Disabled. If I enable active checks, the next check gets scheduled and things resume as expected (for a while).
nagios.log shows this:
[1683717541] SERVICE ALERT: host2;sit_devices;UNKNOWN;HARD;1;(No output on stdout) stderr: connect to address 10.192.26.1 port 5666: Connection timed out
...
[1683717767] Error: External command failed -> DISABLE_SVC_CHECK;host2;sit_devices
I interpret this to mean that because the active check command timed out once, Nagios has decided to place this host-service check combination in disabled status.
It's not uncommon for our monitored sites to have unexpected network outages or server downtime.
Is there a config setting to prevent the disable of the active check in this command timeout situation? If this is desired behavior to prevent thread overload, is there a way to get Nagios to automatically re-enable checks that get into this state?
This behavior lowers the integrity of our Nagios dashboards, as users stop trusting the statuses being reported.
Thanks in advance,
John