Page 1 of 2
About plugin time out state
Posted: Thu Nov 10, 2016 5:39 am
by Pitone_Maledetto
Hi all,
I have a check_raid via hpacucli set up on several DB postgres servers with Nagios Core 4.2.1.
All is fine but I get a CRITICAL notification when the plugin takes more than 120 seconds to reply on a couple of servers.
I wanted to avoid the alert all together since then the check is performed with an OK exit status after a second or third try but can't disable c from the notification_options list.
So I went and modified the nagios.cfg file:
# SERVICE CHECK TIMEOUT STATE
# This setting determines the state Nagios will report when a
# service check times out - that is does not respond within
# service_check_timeout seconds. This can be useful if a
# machine is running at too high a load and you do not want
# to consider a failed service check to be critical (the default).
# Valid settings are:
# c - Critical (default)
# u - Unknown
# w - Warning
# o - OK
service_check_timeout_state=u
to not avail since I still get:
servername/Arrays is CRITICAL:
CRITICAL - Plugin timed out
I have resorted to increase the time out level from 120 to 180 but I don't relaly would like to go much higher than that just fo rthe sake of silencing one check.
The plugin itself has not got a switch to change the time out notification from CRITICAL to UNKNOWN, so I just wondering why the main configuration setting does not work to change globally the time out exit status.
Thank you all for the support.
Ciao
Re: About plugin time out state
Posted: Thu Nov 10, 2016 3:56 pm
by avandemore
That seemed extremely long for any type of raid check so I looked at this:
http://h20564.www2.hpe.com/hpsc/doc/pub ... -c03696601
https://github.com/glensc/nagios-plugin ... -138866801
Do either of those apply to you?
Re: About plugin time out state
Posted: Thu Nov 10, 2016 4:38 pm
by tgriep
Can you post your full nagios.cfg file so we can check it's configuration for the Unknown issue you are having when the timeout occurs?
Re: About plugin time out state
Posted: Fri Nov 11, 2016 4:44 am
by Pitone_Maledetto
Hi both,
Thank you for the reply.
I have read the thread on the check_raid GIT page where it mentions to use cciss_vol_status and I have indeed implemented the cciss check bypassing the check_raid plugin.
This direct approach to run cciss is due to some bugs that I have promptly discussed with the plugin developer and I am waiting for a patch.
Anyhow in my case the cciss check was causing intermittent loss of heartbeat to some of our custom components causing some major incident every 35 minutes so I had to drop it in favour of hpacucli.
it might very well be an issue just to our setup in fact cciss checks did not cause any problem to other servers but a few.
Anyhow besides the plugin used I would like Nagios to exit UNKNOWN when there is a plugin time out.
Attached you will find my nagios.cfg file.
Thank you.
Re: About plugin time out state
Posted: Fri Nov 11, 2016 9:45 am
by tgriep
I tested the timeout state on Core version 4.2.2 and it worked like expected.
You may need to look at the nagios.log file when the plugin times out and see if there is some clue on why it is not working for you.
Re: About plugin time out state
Posted: Fri Nov 11, 2016 10:33 am
by Pitone_Maledetto
Just now...
[1478877034] Auto-save of retention data completed successfully.
[1478878259] SERVICE ALERT: servername;Arrays;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
[1478878368] SERVICE ALERT: servername;Arrays;OK;SOFT;2;OK: hpacucli:[Smart Array P410i: Array A(OK)[LUN1:OK]]
Re: About plugin time out state
Posted: Fri Nov 11, 2016 11:09 am
by tgriep
That looks to be the normal timeout message that is generated by the plugin itself and it returned a critical state.
When a plugin hits the service check timeout that is set in the nagios.cfg file, is will generate the log entry like below.
Code: Select all
[1478813340] SERVICE ALERT: localhost;test;UNKNOWN;SOFT;1;(Service check timed out after 60.05 seconds)
Mine is set to 60 seconds and you can see the UNKNOWN state and that it was a Service Check Timeout.
Try decreasing the service_check_timeout setting and see if that makes it work for you.
Re: About plugin time out state
Posted: Mon Nov 14, 2016 3:14 am
by Pitone_Maledetto
Hi tgriep,
thank you for the reply.
Why would decreasing the time out setting help?
Would this not make the time out alert trigger more often?
I am testing a check_ilo plugin that should also check the array/LUN statuses, I have altered said plugin to get an UNKOWN exit status at time out instead of a CRITICAL.
If the service_check_timeout_state won't over-ride the plugin exit itself (I thought it would) I don't see another solution other than modify the plugin exit itself.
Is there any other configuration that might coincidentally overide the global time out exit status?
Regards
Re: About plugin time out state
Posted: Mon Nov 14, 2016 6:40 am
by Pitone_Maledetto
just now with check_load...
[1479123101] HOST ALERT: servername;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
[1479123105] SERVICE ALERT: servername;Load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
[1479123133] HOST ALERT: servername;UP;SOFT;2;PING WARNING - Packet loss = 82%, RTA = 108.35 ms
[1479123207] SERVICE ALERT: servername;Load;OK;SOFT;2;OK - load average: 0.00, 0.00, 0.00
should not the Load check default to the global time out exit of UNKNOWN?
Re: About plugin time out state
Posted: Mon Nov 14, 2016 10:51 am
by tgriep
Most plugins have a built in timeout. If the timeout of the plugin is met before the global service timeout, the plugin will return whatever state the plugin is programmed to to.
If a plugin doesn't have a timeout built in, that is where the global timeout is usefull. When the nagios process runs a plugin that will not timeout on it's own, the global timeout setting will stop it from running ans that is where it will use the service_check_timeout_state option.