NCPA timeout on check results in CRITICAL alert

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
mvndnburg
Posts: 64
Joined: Wed Sep 21, 2016 2:53 am

NCPA timeout on check results in CRITICAL alert

Post by mvndnburg »

Hi,

Running Nagios XI 5.4.4 on RHEL 6 with NCPA 2.0.3 (Windows, Linux).


Some of our service checks call an NCPA plugin (a Powershell script) which sometimes times out. Strangely enough this results in a CRITICAL alert for the service, and an accompanying notification. I would expect to see an UNKNOWN alert based on previous experience and checking the code of check_ncpa.py in Git.

I have done the following debugging:

1. run the script from the command line and when it times out, record the return values from check_ncpa.py, this looks good: exit value is 3 and stdout text contains UNKNOWN:
bash-4.1$ time /usr/local/nagios/libexec/check_ncpa.py -H <host> -t '<token>' -P 5693 -M 'plugins/Nagios_Plugin_eventfinder_Application_log.ps1/3221242535/Error'
UNKNOWN: Execution exceeded timeout threshold of 60s

real 1m0.050s
user 0m0.046s
sys 0m0.022s

bash-4.1$ echo $?
3
2. Run the same from the NCPA GUI. Exit value and stdout text differ:
https://<host>:5693/api/plugins/Nagios_Plugin_eventfinder_Application_log.ps1/3221242535/Error


{ "returncode": 1, "stdout": "Error: Plugin command timed out. (60 sec)" }
3. Check Nagios Event Log: for these timeouts a Warning is thown for the service check, followed by a critical service alert. See attached Hc_3151.jpg.

4. Check Nagios Notifications, a CRITICAL notification is sent out. See attached Hc_3150.jpg.


Here is one of the service definitions that exhibits this behaviour:

Code: Select all

define service {
        service_description             MSSQL Windows application log event ID 17063
        use                             xiwizard_ncpa_service
        hostgroup_name                  ACC SQL Server hosts,PRD SQL Server hosts
        display_name                    MSSQL event 17063
        servicegroups                   MS SQL Server services
        check_command                   check_xi_ncpa_agent!-t '<token>' -P 5693 -M 'plugins/Nagios_Plugin_eventfinder_Application_log.ps1/3221242535/Error'!!!!!!!
        max_check_attempts              1
        check_interval                  4
        retry_interval                  1
        check_period                    xi_timeperiod_24x7
        notification_options            c,
        contact_groups                  ISD SQL Server team
        register                        1
        }
Why are the CRITICAL service notifications sent out for this service check that times out? Is there a way to suppress it?

Edit: I know how to increase the timeout value and I know the check should be made quicker, but that's not the issue ;)
You do not have the required permissions to view the files attached to this post.
--
Martijn
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: NCPA timeout on check results in CRITICAL alert

Post by lmiltchev »

You can add the following directive in the nagios.cfg file:

Code: Select all

service_check_timeout_state=u
in order for the critical state, caused by timeouts to change to unknown.

https://assets.nagios.com/downloads/nag ... gmain.html

Also, you could set up a timeout on the plugin itself using the -T option, which is lower than the default value of 60 sec in the main config file (service_check_timeout=60):

Example:

Code: Select all

check_command                   check_xi_ncpa_agent!-t '<token>' -P 5693 -T 50 -M 'plugins/Nagios_Plugin_eventfinder_Application_log.ps1/3221242535/Error'
Be sure to check out our Knowledgebase for helpful articles and solutions!
mvndnburg
Posts: 64
Joined: Wed Sep 21, 2016 2:53 am

Re: NCPA timeout on check results in CRITICAL alert

Post by mvndnburg »

Hi,

Code: Select all

service_check_timeout_state=u
... does the trick.

Thanks. You may close the thread.
--
Martijn
Locked