Page 1 of 1

check_esx3-0.5.pl behavior

Posted: Tue Mar 15, 2016 5:11 pm
by gormank
I have a number of ESXi hosts and run various checks using this script. My issues are that when a command times out, it issues a critical alert, which to me, it should be unknown. It was a timeout, not an actual issue.

The other trouble is that when the networking check times out, it returns critical, with critical in the message, but the text is "all 8 NICs are connected," which is misleading, since what it means is the check timed out.

/usr/local/nagios/libexec/check_esx3-0.5.pl -H "cocsm2mvesx002" -t 45 -f "/usr/local/nagiosxi/etc/components/vmware/vmware_auth.txt" -l NET

CHECK_ESX3-0.5.PL CRITICAL - all 8 NICs are connected

I can't force all criticals to warnings because that will hide real issues.

Re: check_esx3-0.5.pl behavior

Posted: Tue Mar 15, 2016 6:24 pm
by Box293
I know that the check_esx3.pl script has a few bugs. The developers have now called it check_vmware_api

I can be downloaded from here:

http://git.op5.org/p/system-addons/plug ... re_api.git

I think this is the actual URL of the plugin:

I believe you can just rename it to check_esx3.pl and it should slot right in.

Once you've done that, does it fix your NIC issue?

Re: check_esx3-0.5.pl behavior

Posted: Wed Mar 16, 2016 10:06 am
by gormank
My Nagios system had a birthday, so my subscriptions had expired.
While I eventually found check_vmware_api, it wasn't at that link. It is much newer though. The one I found was in a fork of check_vmware_api. I think I'll keep looking.

How would fixing the NIC issue fix the general misbehavior of the script when it times out?

Re: check_esx3-0.5.pl behavior

Posted: Wed Mar 16, 2016 3:00 pm
by gormank
Following this http://git.op5.org/p/system-addons/plug ... re_api.git leads to an empty list.
Using the download link https://exchange.nagios.org/directory/P ... pi/details leads to a not found message.
Google leads me to https://github.com/BaldMansMojo/check_vmware_esx, which is a fork of check_vmware_api.

I finally found this, which seems to be the original, no fork version.
http://git.op5.org/gitweb?p=system-addo ... ;a=summary

Ok, its running with the newer version and I'll wait to see what it says.

Thanks!

Re: check_esx3-0.5.pl behavior

Posted: Wed Mar 16, 2016 3:43 pm
by lmiltchev
Are you having a better luck with the "check_vmware_api.pl" plugin? What's the status on timeouts?

Re: check_esx3-0.5.pl behavior

Posted: Wed Mar 16, 2016 4:01 pm
by gormank
I checked the log and for some reason today is a slow day for timeouts. That is there aren't that many. ~30 on the day (UTC), vs >100 yesterday. I see the timeouts as a network problem, but the return values are/were using an incorrect value.

On my twin site w/ the same Nagios config, the counts are zero.

I haven't seen anything since starting to use the new script. Have to wait a bit more, but it looks promising...

# grep "SERVICE ALERT" /usr/local/nagios/var/archives/nagios-03-16-2016-00.log | grep -v OK | grep -i esx | cut -c 1-132 | perl -pe 's/(\d+)/localtime($1)/e' | grep -v 127 | wc -l
111

# grep "SERVICE ALERT" /usr/local/nagios/var/nagios.log | grep -v OK | grep -i esx | cut -c 1-132 | perl -pe 's/(\d+)/localtime($1)/e' | grep -v 127 | wc -l
32

Re: check_esx3-0.5.pl behavior

Posted: Thu Mar 17, 2016 8:39 am
by lmiltchev
Both plugins (check_esx3.pl & check_vmware_api.pl) work fine on my test box but I don't have any timeouts. :)
I haven't seen anything since starting to use the new script. Have to wait a bit more, but it looks promising...
Let us know how it goes. We will keep this thread open.

Re: check_esx3-0.5.pl behavior

Posted: Thu Mar 17, 2016 10:46 am
by gormank
The timeouts are due to a larger problem. Try to focus on the issue described.

Re: check_esx3-0.5.pl behavior

Posted: Thu Mar 17, 2016 11:49 am
by tmcdonald
The issue described is that the plugin, when timing out, is exiting with Critical instead of Unknown, which you suggest would be a better exit status for a timeout. Please correct me if I am wrong.

If this is correct, then from a Nagios standpoint you can take a look at this option in your nagios.cfg:

https://assets.nagios.com/downloads/nag ... eout_state

However, this will only apply to a service that hits the global timeout (defaults to 60 seconds, defined by the service_check_timeout parameter). If the plugin itself has timeout logic built in, then you will need to contact the original author of that plugin for a fix.