Hi Team,
In our environment, we are using Nagio core 3.5.0 for monitoring all the assets. We have esxi servers configured in nagios using the check_vmware_api.pl plugin for all the other services to monitor.
We found that all the other services are reporting the status change perfectly except the Check_runtime_health. The issue in this particular service is, when there is any failure, the status is not getting changed from OK to Critical/Warning but reporting the error message in status information with the OK status.
As there is no Critical/Warning alert, we will not get to know the failure until and unless we look the status information.
The error looks like below:
CHECK_VMWARE_API.PL OK - 2 health issue(s) found in 371 checks
Kindly help us in resolving the issue. We need the alert to be generated when there are any health issue in the server.
Check_VMWARE_runtime_health status change issue
Re: Check_VMWARE_runtime_health status change issue
Can you show us the full command definition associated with the service? You may just need to define a warning / critical threshold.
Former Nagios Employee
Re: Check_VMWARE_runtime_health status change issue
Thank you for your response.
The command definition is given as below:
/usr/local/nagios/libexec/check_vmware_api.pl -H $Hostaddress$ -f /home/nagios/.nagios_user -l runtime -s health
The command definition is given as below:
/usr/local/nagios/libexec/check_vmware_api.pl -H $Hostaddress$ -f /home/nagios/.nagios_user -l runtime -s health
Re: Check_VMWARE_runtime_health status change issue
Got it.
It looks like the plugin supports a warning / critical threshold. What happens if you alter your command to include a -w and -c?
Code: Select all
[ -w <warn_range> ] [ -c <crit_range>
Former Nagios Employee
Re: Check_VMWARE_runtime_health status change issue
Thank you for your response!
Warning and threshold is applicable for the command but there are two issues here.
1. Nagios needs to report the failure if anything and in the case of threshold we should give number of health checks -1 as warning and -2 as critical.
Say for example,
CHECK_VMWARE_API.PL OK - All 331 health checks are GREEN
For the above service, 331 health checks gives OK, which means 330 should be warning and 229 should be critical (It might be a bad idea). Kindly suggest me an alternative.
2. And to give the thresholds, we do have different count of health checks for each and every server where it is hard to create each and every service with different thresholds for all the servers(nearly 100+) in our environment.
Warning and threshold is applicable for the command but there are two issues here.
1. Nagios needs to report the failure if anything and in the case of threshold we should give number of health checks -1 as warning and -2 as critical.
Say for example,
CHECK_VMWARE_API.PL OK - All 331 health checks are GREEN
For the above service, 331 health checks gives OK, which means 330 should be warning and 229 should be critical (It might be a bad idea). Kindly suggest me an alternative.
2. And to give the thresholds, we do have different count of health checks for each and every server where it is hard to create each and every service with different thresholds for all the servers(nearly 100+) in our environment.
Re: Check_VMWARE_runtime_health status change issue
This is a single check that monitors many different things... You need to consider the number of "Alerts", not the number of "health checks" when setting up your warning and critical thresholds. For example, when I try a similar check, my output is:
This check should give you a "warning" if the number of alerts is greater than 1, and "critical" if the number of alerts is greater than 2. I believe you could set up these same thresholds for all of your checks. This way, you will be notified in case of a warning/critical issue.
In your case, you probably have Alerts=2, so you may use something like this:CHECK_VMWARE_API.PL OK - All 212 health checks are GREEN: fan (5x); system (1x); CPU (2x); Cable/Interconnect (2x); Watchdog (4x); voltage (21x); Battery (3x); Processors (12x); Software Components (96x); Memory (1x); Storage (56x); power (7x); Chassis (1x); temperature (1x); | Alerts=0;;
Code: Select all
/usr/local/nagios/libexec/check_vmware_api.pl -H $Hostaddress$ -f /home/nagios/.nagios_user -l runtime -s health -w 1 -c 2
Be sure to check out our Knowledgebase for helpful articles and solutions!