NCPA timeout on check results in CRITICAL alert
Posted: Fri Aug 11, 2017 7:06 am
Hi,
Running Nagios XI 5.4.4 on RHEL 6 with NCPA 2.0.3 (Windows, Linux).
Some of our service checks call an NCPA plugin (a Powershell script) which sometimes times out. Strangely enough this results in a CRITICAL alert for the service, and an accompanying notification. I would expect to see an UNKNOWN alert based on previous experience and checking the code of check_ncpa.py in Git.
I have done the following debugging:
1. run the script from the command line and when it times out, record the return values from check_ncpa.py, this looks good: exit value is 3 and stdout text contains UNKNOWN:
4. Check Nagios Notifications, a CRITICAL notification is sent out. See attached Hc_3150.jpg.
Here is one of the service definitions that exhibits this behaviour:
Why are the CRITICAL service notifications sent out for this service check that times out? Is there a way to suppress it?
Edit: I know how to increase the timeout value and I know the check should be made quicker, but that's not the issue
Running Nagios XI 5.4.4 on RHEL 6 with NCPA 2.0.3 (Windows, Linux).
Some of our service checks call an NCPA plugin (a Powershell script) which sometimes times out. Strangely enough this results in a CRITICAL alert for the service, and an accompanying notification. I would expect to see an UNKNOWN alert based on previous experience and checking the code of check_ncpa.py in Git.
I have done the following debugging:
1. run the script from the command line and when it times out, record the return values from check_ncpa.py, this looks good: exit value is 3 and stdout text contains UNKNOWN:
2. Run the same from the NCPA GUI. Exit value and stdout text differ:bash-4.1$ time /usr/local/nagios/libexec/check_ncpa.py -H <host> -t '<token>' -P 5693 -M 'plugins/Nagios_Plugin_eventfinder_Application_log.ps1/3221242535/Error'
UNKNOWN: Execution exceeded timeout threshold of 60s
real 1m0.050s
user 0m0.046s
sys 0m0.022s
bash-4.1$ echo $?
3
3. Check Nagios Event Log: for these timeouts a Warning is thown for the service check, followed by a critical service alert. See attached Hc_3151.jpg.https://<host>:5693/api/plugins/Nagios_Plugin_eventfinder_Application_log.ps1/3221242535/Error
{ "returncode": 1, "stdout": "Error: Plugin command timed out. (60 sec)" }
4. Check Nagios Notifications, a CRITICAL notification is sent out. See attached Hc_3150.jpg.
Here is one of the service definitions that exhibits this behaviour:
Code: Select all
define service {
service_description MSSQL Windows application log event ID 17063
use xiwizard_ncpa_service
hostgroup_name ACC SQL Server hosts,PRD SQL Server hosts
display_name MSSQL event 17063
servicegroups MS SQL Server services
check_command check_xi_ncpa_agent!-t '<token>' -P 5693 -M 'plugins/Nagios_Plugin_eventfinder_Application_log.ps1/3221242535/Error'!!!!!!!
max_check_attempts 1
check_interval 4
retry_interval 1
check_period xi_timeperiod_24x7
notification_options c,
contact_groups ISD SQL Server team
register 1
}
Edit: I know how to increase the timeout value and I know the check should be made quicker, but that's not the issue