Don't send email when error is "Unknown Error"
-
- Posts: 24
- Joined: Wed Mar 16, 2016 4:59 am
Don't send email when error is "Unknown Error"
Hey everyone,
I'm using check_esx plugin to minior esx hosts, it works fine but 10 times per day i get false alerts: CHECK_ESX CRITICAL - HOST IO Unknown error
I don't know how to solve the plugin (prob bug) but i want to know a way to tell nagios to ignore this and certainly don't send emails, how can I do this?
Borrie
I'm using check_esx plugin to minior esx hosts, it works fine but 10 times per day i get false alerts: CHECK_ESX CRITICAL - HOST IO Unknown error
I don't know how to solve the plugin (prob bug) but i want to know a way to tell nagios to ignore this and certainly don't send emails, how can I do this?
Borrie
Re: Don't send email when error is "Unknown Error"
I would look look at using the negate plugin, to have it run before, which will let you define the exit code for an unknown state.
For example -
The 3 with check_dummy tells it to exit with a 3 (UNKNOWN) state. Using negate, we tell it to flip -u to OK.
For example -
Code: Select all
[root@centos7 libexec]# ./negate -u OK ./check_dummy 3
UNKNOWN
[root@centos7 libexec]# echo $?
0
Former Nagios Employee
-
- Posts: 24
- Joined: Wed Mar 16, 2016 4:59 am
Re: Don't send email when error is "Unknown Error"
RKennedy,
Thank you for your answer!
I've put it in like this:
command_line $USER1$/negate -u OK $USER1$/check_esx -H $HOSTADDRESS$ -u $USER11$ -p $USER12$ -l cpu -s usage -w $ARG1$ -c $ARG2$ -t 90 3
But still get these: CHECK_ESX CRITICAL - HOST CPU Unknown error
Am i doing something wrong?
Thank you for your answer!
I've put it in like this:
command_line $USER1$/negate -u OK $USER1$/check_esx -H $HOSTADDRESS$ -u $USER11$ -p $USER12$ -l cpu -s usage -w $ARG1$ -c $ARG2$ -t 90 3
But still get these: CHECK_ESX CRITICAL - HOST CPU Unknown error
Am i doing something wrong?
Re: Don't send email when error is "Unknown Error"
The error message will still say unknown, but the status in Nagios should change to be OK.
Ah - taking a step back here, it looks like it's perhaps exiting on a CRITICAL state since the unknown error is actually in the message. What state is currently reported in Nagios?
Ah - taking a step back here, it looks like it's perhaps exiting on a CRITICAL state since the unknown error is actually in the message. What state is currently reported in Nagios?
Former Nagios Employee
-
- Posts: 24
- Joined: Wed Mar 16, 2016 4:59 am
Re: Don't send email when error is "Unknown Error"
It is indeed: State: CRITICAL
Re: Don't send email when error is "Unknown Error"
This will be difficult to mitigate without a wrapper script that is specifically identifying that text, and flipping the state.
What is the full output of check_esx -h? I'm trying to figure out which version you're using, and which plugin.
The one I'm looking at, check_esx3, has a feature built in for retries -
I'm wondering if you could utilize this to mitigate the unknown errors. The other option, is increasing your max_check_attempts to provide more time for a real check to come through, instead of a false positive.
What is the full output of check_esx -h? I'm trying to figure out which version you're using, and which plugin.
The one I'm looking at, check_esx3, has a feature built in for retries -
Code: Select all
print " -R retries: # of retries ([0..20]) for individual SNMP queries\n";
Former Nagios Employee
-
- Posts: 24
- Joined: Wed Mar 16, 2016 4:59 am
Re: Don't send email when error is "Unknown Error"
Already made the max attempts higher..
check_esx 0.5.0
check_esx 0.5.0
Code: Select all
This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).
VMWare Infrastructure plugin
Usage: check_esx -D <data_center> | -H <host_name> [ -N <vm_name> ]
-u <user> -p <pass> | -f <authfile>
-l <command> [ -s <subcommand> ]
[ -x <black_list> ] [ -o <additional_options> ]
[ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
[ -V ] [ -h ]
-?, --usage
Print usage information
-h, --help
Print detailed help screen
-V, --version
Print version information
--extra-opts=[section][@file]
Read options from an ini file. See https://www.monitoring-plugins.org/doc/extra-opts.html
for usage and examples.
-H, --host=<hostname>
ESX or ESXi hostname.
-C, --cluster=<clustername>
ESX or ESXi clustername.
-D, --datacenter=<DCname>
Datacenter hostname.
-N, --name=<vmname>
Virtual machine name.
-u, --username=<username>
Username to connect with.
-p, --password=<password>
Password to use with the username.
-f, --authfile=<path>
Authentication file with login and password. File syntax :
username=<login>
password=<password>
-w, --warning=THRESHOLD
Warning threshold. See
http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
for the threshold format.
-c, --critical=THRESHOLD
Critical threshold. See
http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
for the threshold format.
-l, --command=COMMAND
Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
-s, --subcommand=SUBCOMMAND
Specify subcommand
-S, --sessionfile=SESSIONFILE
Specify a filename to store sessions for faster authentication
-x, --exclude=<black_list>
Specify black list
-o, --options=<additional_options>
Specify additional command options
-t, --timeout=INTEGER
Seconds before plugin times out (default: 30)
-v, --verbose
Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ means blank or not specified parameter) :
Common options for VM, Host and DC :
* cpu - shows cpu info
+ usage - CPU usage in percentage
+ usagemhz - CPU usage in MHz
^ all cpu info
* mem - shows mem info
+ usage - mem usage in percentage
+ usagemb - mem usage in MB
+ swap - swap mem usage in MB
+ overhead - additional mem used by VM Server in MB
+ overall - overall mem used by VM Server in MB
+ memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
^ all mem info
* net - shows net info
+ usage - overall network usage in KBps(Kilobytes per Second)
+ receive - receive in KBps(Kilobytes per Second)
+ send - send in KBps(Kilobytes per Second)
^ all net info
* io - shows disk io info
+ read - read latency in ms (totalReadLatency.average)
+ write - write latency in ms (totalWriteLatency.average)
^ all disk io info
* runtime - shows runtime info
+ status - overall host status (gray/green/red/yellow)
+ issues - all issues for the host
^ all runtime info
VM specific :
* cpu - shows cpu info
+ wait - CPU wait time in ms
+ ready - CPU ready time in ms
* mem - shows mem info
+ swapin - swapin mem usage in MB
+ swapout - swapout mem usage in MB
+ active - active mem usage in MB
* io - shows disk I/O info
+ usage - overall disk usage in MB/s
* runtime - shows runtime info
+ con - connection state
+ cpu - allocated CPU in MHz
+ mem - allocated mem in MB
+ state - virtual machine state (UP, DOWN, SUSPENDED)
+ consoleconnections - console connections to VM
+ guest - guest OS status, needs VMware Tools
+ tools - VMWare Tools status
Host specific :
* net - shows net info
+ nic - makes sure all active NICs are plugged in
* io - shows disk io info
+ aborted - aborted commands count
+ resets - bus resets count
+ kernel - kernel latency in ms
+ device - device latency in ms
+ queue - queue latency in ms
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
^ all datastore info
* runtime - shows runtime info
+ con - connection state
+ health - checks cpu/storage/memory/sensor status
+ maintenance - shows whether host is in maintenance mode
+ list(vm) - list of VMWare machines and their statuses
* service - shows Host service info
+ (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>
^ show all services
* storage - shows Host storage info
+ adapter - list bus adapters
+ lun - list SCSI logical units
+ path - list logical unit paths
DC specific :
* io - shows disk io info
+ aborted - aborted commands count
+ resets - bus resets count
+ kernel - kernel latency in ms
+ device - device latency in ms
+ queue - queue latency in ms
* vmfs - shows Datastore info
+ (name) - free space info for datastore with name (name)
^ all datastore info
* runtime - shows runtime info
+ list(vm) - list of VMWare machines and their statuses
+ listhost - list of VMWare esx host servers and their statuses
+ tools - VMWare Tools status
* recommendations - shows recommendations for cluster
+ (name) - recommendations for cluster with name (name)
^ all clusters recommendations
Copyright (c) 2008 op5
Last edited by tmcdonald on Mon Feb 06, 2017 11:33 am, edited 1 time in total.
Reason: Please use [code][/code] tags around long output
Reason: Please use [code][/code] tags around long output
Re: Don't send email when error is "Unknown Error"
You'll need to modify the plugin to do as you're after since the default for an unknown is critical. Take a look at this part for example -
You would need to change $res and $result in certain places to be equal to OK, so that it can exit properly. (this is just at a glance, I did not dive deeply into the code or do further testing - this is a bit beyond what we generally provide support for)
Code: Select all
sub host_cpu_info
{
my ($host, $np, $subcommand, $addopts) = @_;
my $res = CRITICAL;
my $output = 'HOST CPU Unknown error';
Former Nagios Employee
Re: Don't send email when error is "Unknown Error"
Just checking in since we have not heard from you in a while. Did @rkennedy's post clear things up or has the issue otherwise been resolved?
Former Nagios employee