Page 1 of 2

Service Check Timed Out Alert returning as Critical Alert

Posted: Fri Oct 20, 2017 11:24 am
by Sampath.Basireddy
Hello There,

How to change the Service Check Timed Out On Worker Alerts from CRITICAL to UNKNOWN

This is causing huge amount of email alerts and Incidents in our ITSM Tool.

Thank You.

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Fri Oct 20, 2017 11:42 am
by npolovenko
Hello, @Sampath.Basireddy.

What service check are you using? Can you upload the plugin here?
Sometimes plugins will allow you to define a critical state. Try running your_plugin with --help.

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Fri Oct 20, 2017 11:48 am
by Sampath.Basireddy
This issue is with check_ncpa.py.

Code: Select all

Usage: check_ncpa.py [options]

Options:
  -h, --help            show this help message and exit
  -H HOSTNAME, --hostname=HOSTNAME
                        The hostname to be connected to.
  -M METRIC, --metric=METRIC
                        The metric to check, this is defined on client system.
                        This would also be the plugin name in the plugins
                        directory. Do not attach arguments to it, use the -a
                        directive for that. DO NOT INCLUDE the api/
                        instruction.
  -P PORT, --port=PORT  Port to use to connect to the client.
  -w WARNING, --warning=WARNING
                        Warning value to be passed for the check.
  -c CRITICAL, --critical=CRITICAL
                        Critical value to be passed for the check.
  -u UNITS, --units=UNITS
                        The unit prefix (k, Ki, M, Mi, G, Gi, T, Ti) for b and
                        B unit types which calculates the value returned.
  -n UNIT, --unit=UNIT  Overrides the unit with whatever unit you define. Does
                        not perform calculations. This changes the unit of
                        measurement only.
  -a ARGUMENTS, --arguments=ARGUMENTS
                        Arguments for the plugin to be run. Not necessary
                        unless you're running a custom plugin. Given in the
                        same as you would call from the command line. Example:
                        -a '-w 10 -c 20 -f /usr/local'
  -t TOKEN, --token=TOKEN
                        The token for connecting.
  -T TIMEOUT, --timeout=TIMEOUT
                        Enforced timeout, will terminate plugins after this
                        amount of seconds. [60]
  -d, --delta           Signals that this check is a delta check and a local
                        state will kept.
  -l, --list            List all values under a given node. Do not perform a
                        check.
  -v, --verbose         Print more verbose error messages.
  -D, --debug           Print LOTS of error messages. Used mostly for
                        debugging.
  -V, --version         Print version number of plugin.
  -q QUERYARGS, --queryargs=QUERYARGS
                        Extra query arguments to pass in the NCPA URL.
  -s, --secure          Require successful certificate verification. Does not
                        work on Python < 2.7.9.
  -p, --performance     Print performance data even when there is none. Will
                        print data matching the return code of this script

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Fri Oct 20, 2017 12:53 pm
by npolovenko
@Sampath.Basireddy, Is there any particular reason that your checks are timing out? Is it just one check or all of them?
Did you consider increasing the timeout with --timeout flag:

Code: Select all

-T TIMEOUT, --timeout=TIMEOUT
                        Enforced timeout, will terminate plugins after this
                        amount of seconds. [60]
Also, can you send us the plugin itself? NCPA in your case works as an agent calling plugins on your windows machine. So we would like to take a look at those.

Can send in your Nagios XI System Profile so I can review it?
To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and upload it here. You could also send it in pm. If you send in a personal message please notify us here about that as well.
Thanks.

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Fri Oct 20, 2017 2:40 pm
by Sampath.Basireddy
One reason was due to the NCPA looking for NFS on Linux Servers. We do have NCP Agent 2.0.5 installed on all of the servers which I think resolved issues with NFS Error causing NCP Agent failing to start at server reboot.

I don't think increasing the timeout would help here as most of the cases it is taking more than few mins to resolve automatically or it never resolves without manual intervention.

I uploaded the check_ncpa.py plugin in my last post.

I will PM you the system profile.

Thank You,
Sampath.

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Fri Oct 20, 2017 2:47 pm
by Sampath.Basireddy
@npolovenko

I sent you the System Profile via Personal Message.

Thank You,

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Fri Oct 20, 2017 2:51 pm
by npolovenko
@Sampath.Basireddy, I received your pm but unfortunately without the attachment. When you select the attachment file you need to click on upload button. Otherwise, you could upload it on the service like google drive and share a link with me in the pm.
Thank you.

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Fri Oct 20, 2017 3:07 pm
by Sampath.Basireddy
I did not notice, the file size is 1.2MB and the limit is 1 MB to send via PM.

I attached the zip as 2 files in the personal message.

Uploading to google drive or another shared location is restricted in our environment.

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Mon Oct 23, 2017 9:52 am
by npolovenko
@Sampath.Basireddy, Thank you. I received your profile and will get back shortly.

Re: Service Check Timed Out Alert returning as Critical Aler

Posted: Mon Oct 23, 2017 10:22 am
by npolovenko
One reason was due to the NCPA looking for NFS on Linux Servers. We do have NCP Agent 2.0.5 installed on all of the servers which I think resolved issues with NFS Error causing NCP Agent failing to start at server reboot.

I don't think increasing the timeout would help here as most of the cases it is taking more than few mins to resolve automatically or it never resolves without manual intervention.
Hello, @Sampath.Basireddy, Just to clarify, you have a timing out issue only on one Server that has an older version of NCPA, correct? And the reason for timing out is because NCPA won't start automatically after a reboot or is it because it takes a long time to look through NFS?
I also suppose that all services for a given host timeout, or is there just a particular one?