Page 1 of 1

check_openmanage return values to Nagios

Posted: Mon Jul 14, 2014 12:38 pm
by murdoch_222
Greetings,

Apologies if this question was previously raised, I searched through the discussion topics and have not found a similar thread.

The basic problem we are facing is the check_openmanage plugin (found here: http://folk.uio.no/trondham/software/ch ... anage.html ) used to monitor Dell systems is reporting back a critical alert when the service check times out. I'm not worried to receive an alert on a timeout, however I am concerned this alert is classified as "CRITICAL." Whenever we receive a timeout, we wait for 20 minutes and the service check recovers. We have several servers that are under heavy loads, which as I understand can increase your likelihood to get a timeout on this plugin. Given enough time, the service check will eventually recover. Should this be a CRITICAL alert?? (ie - requires immediate attention 24x7 attention?) Obviously, we feel this should return a different value (UNKNOWN), but I'm not finding how to change the setting on check_openmanage to return UNKNOWN rather than CRITCAL. The website listed above provides usage details for this plugin, and apparently I can check the alerting thresholds for temperature only, so far as I can tell. I've also parsed through the main script (check_openmanage) used to perform the service check and changed the return values for timeouts, to no avail:

261 # Setting timeout
262 $SIG{ALRM} = sub {
263 print "PLUGIN TIMEOUT: $NAME timed out after $opt{timeout} seconds\n";
264 exit $E_WARNING; ##### this was originally set to $E_CRITICAL, but we're still getting critical alerts on our timeouts
265 };

This might be originating from an snmp timeout setting, rather than the script itself, yet other snmp checks are returning "UNKNOWN" for timeouts, not criticals. If anyone has faced this problem before and knows a solution, please respond. We have grown tired of receiving critical alerts at 3 am for a timeout on the check_openmanage plugin :)

Thank you in advance

Re: check_openmanage return values to Nagios

Posted: Mon Jul 14, 2014 12:47 pm
by abrist
It all depends what is catching the timeout. If the plugin's timeout is longer than the global timeout set in the nagios.cfg file, then the timeout will be caught by nagios, and not the plugin. You will most likely need to add some debug output to the script in order to find where the script is actually timing out.

Re: check_openmanage return values to Nagios

Posted: Tue Jul 15, 2014 9:41 am
by murdoch_222
Thank you abrist, you have me now pointed in the right direction. I completely overlooked the nagios.cfg file because for some reason we removed the service_check_timeout_state from our configuration (hence, as I scanned through our version of this file, I did not see this option). Silly oversight. I'm convinced the alert is not coming from the check_openmanage plugin, as 1) I've already changed the exit status for a timeout to a warning , and 2) the format of the notification from the plugin does not match the email alert I am receiving. However, the default configuration for nagios.cfg is to issue a critical alert for timeouts, leading me to believe (for now at least) this must be the source of the error. I'm going to redo our nagios configuration and see if it fixes. I'll be sure to return to label this issue as solved, if indeed it fixes the problem.

Cheers-

Re: check_openmanage return values to Nagios

Posted: Tue Jul 15, 2014 9:46 am
by abrist
Good deal. I wish you best of luck in the hunt. We will leave this thread open for the time being.