check_openmanage return values to Nagios
Posted: Mon Jul 14, 2014 12:38 pm
Greetings,
Apologies if this question was previously raised, I searched through the discussion topics and have not found a similar thread.
The basic problem we are facing is the check_openmanage plugin (found here: http://folk.uio.no/trondham/software/ch ... anage.html ) used to monitor Dell systems is reporting back a critical alert when the service check times out. I'm not worried to receive an alert on a timeout, however I am concerned this alert is classified as "CRITICAL." Whenever we receive a timeout, we wait for 20 minutes and the service check recovers. We have several servers that are under heavy loads, which as I understand can increase your likelihood to get a timeout on this plugin. Given enough time, the service check will eventually recover. Should this be a CRITICAL alert?? (ie - requires immediate attention 24x7 attention?) Obviously, we feel this should return a different value (UNKNOWN), but I'm not finding how to change the setting on check_openmanage to return UNKNOWN rather than CRITCAL. The website listed above provides usage details for this plugin, and apparently I can check the alerting thresholds for temperature only, so far as I can tell. I've also parsed through the main script (check_openmanage) used to perform the service check and changed the return values for timeouts, to no avail:
261 # Setting timeout
262 $SIG{ALRM} = sub {
263 print "PLUGIN TIMEOUT: $NAME timed out after $opt{timeout} seconds\n";
264 exit $E_WARNING; ##### this was originally set to $E_CRITICAL, but we're still getting critical alerts on our timeouts
265 };
This might be originating from an snmp timeout setting, rather than the script itself, yet other snmp checks are returning "UNKNOWN" for timeouts, not criticals. If anyone has faced this problem before and knows a solution, please respond. We have grown tired of receiving critical alerts at 3 am for a timeout on the check_openmanage plugin
Thank you in advance
Apologies if this question was previously raised, I searched through the discussion topics and have not found a similar thread.
The basic problem we are facing is the check_openmanage plugin (found here: http://folk.uio.no/trondham/software/ch ... anage.html ) used to monitor Dell systems is reporting back a critical alert when the service check times out. I'm not worried to receive an alert on a timeout, however I am concerned this alert is classified as "CRITICAL." Whenever we receive a timeout, we wait for 20 minutes and the service check recovers. We have several servers that are under heavy loads, which as I understand can increase your likelihood to get a timeout on this plugin. Given enough time, the service check will eventually recover. Should this be a CRITICAL alert?? (ie - requires immediate attention 24x7 attention?) Obviously, we feel this should return a different value (UNKNOWN), but I'm not finding how to change the setting on check_openmanage to return UNKNOWN rather than CRITCAL. The website listed above provides usage details for this plugin, and apparently I can check the alerting thresholds for temperature only, so far as I can tell. I've also parsed through the main script (check_openmanage) used to perform the service check and changed the return values for timeouts, to no avail:
261 # Setting timeout
262 $SIG{ALRM} = sub {
263 print "PLUGIN TIMEOUT: $NAME timed out after $opt{timeout} seconds\n";
264 exit $E_WARNING; ##### this was originally set to $E_CRITICAL, but we're still getting critical alerts on our timeouts
265 };
This might be originating from an snmp timeout setting, rather than the script itself, yet other snmp checks are returning "UNKNOWN" for timeouts, not criticals. If anyone has faced this problem before and knows a solution, please respond. We have grown tired of receiving critical alerts at 3 am for a timeout on the check_openmanage plugin
Thank you in advance