Don't send email when error is "Unknown Error"

borrierulez · Post by **borrierulez** » Tue Jan 31, 2017 9:18 am

Hey everyone,

I'm using check_esx plugin to minior esx hosts, it works fine but 10 times per day i get false alerts: CHECK_ESX CRITICAL - HOST IO Unknown error

I don't know how to solve the plugin (prob bug) but i want to know a way to tell nagios to ignore this and certainly don't send emails, how can I do this?

Borrie

rkennedy · Post by **rkennedy** » Tue Jan 31, 2017 2:39 pm

I would look look at using the negate plugin, to have it run before, which will let you define the exit code for an unknown state.

For example -

Code: Select all

[root@centos7 libexec]# ./negate -u OK ./check_dummy 3
UNKNOWN
[root@centos7 libexec]# echo $?
0

The 3 with check_dummy tells it to exit with a 3 (UNKNOWN) state. Using negate, we tell it to flip -u to OK.

borrierulez · Post by **borrierulez** » Thu Feb 02, 2017 2:03 pm

RKennedy,

Thank you for your answer!

I've put it in like this:

command_line $USER1$/negate -u OK $USER1$/check_esx -H $HOSTADDRESS$ -u $USER11$ -p $USER12$ -l cpu -s usage -w $ARG1$ -c $ARG2$ -t 90 3

But still get these: CHECK_ESX CRITICAL - HOST CPU Unknown error

Am i doing something wrong?

rkennedy · Post by **rkennedy** » Thu Feb 02, 2017 2:13 pm

The error message will still say unknown, but the status in Nagios should change to be OK.

Ah - taking a step back here, it looks like it's perhaps exiting on a CRITICAL state since the unknown error is actually in the message. What state is currently reported in Nagios?

borrierulez · Post by **borrierulez** » Fri Feb 03, 2017 5:05 am

It is indeed: State: CRITICAL

rkennedy · Post by **rkennedy** » Fri Feb 03, 2017 10:58 am

This will be difficult to mitigate without a wrapper script that is specifically identifying that text, and flipping the state.

What is the full output of check_esx -h? I'm trying to figure out which version you're using, and which plugin.

The one I'm looking at, check_esx3, has a feature built in for retries -

Code: Select all

    print "    -R retries: # of retries ([0..20]) for individual SNMP queries\n";

I'm wondering if you could utilize this to mitigate the unknown errors. The other option, is increasing your max_check_attempts to provide more time for a real check to come through, instead of a false positive.

borrierulez · Post by **borrierulez** » Mon Feb 06, 2017 3:48 am

Already made the max attempts higher..

check_esx 0.5.0

Code: Select all

This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).

VMWare Infrastructure plugin

Usage: check_esx -D <data_center> | -H <host_name> [ -N <vm_name> ]
    -u <user> -p <pass> | -f <authfile>
    -l <command> [ -s <subcommand> ]
    [ -x <black_list> ] [ -o <additional_options> ]
    [ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
    [ -V ] [ -h ]

 -?, --usage
   Print usage information
 -h, --help
   Print detailed help screen
 -V, --version
   Print version information
 --extra-opts=[section][@file]
   Read options from an ini file. See https://www.monitoring-plugins.org/doc/extra-opts.html
   for usage and examples.
 -H, --host=<hostname>
   ESX or ESXi hostname.
 -C, --cluster=<clustername>
   ESX or ESXi clustername.
 -D, --datacenter=<DCname>
   Datacenter hostname.
 -N, --name=<vmname>
   Virtual machine name.
 -u, --username=<username>
   Username to connect with.
 -p, --password=<password>
   Password to use with the username.
 -f, --authfile=<path>
   Authentication file with login and password. File syntax :
   username=<login>
   password=<password>
 -w, --warning=THRESHOLD
   Warning threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -c, --critical=THRESHOLD
   Critical threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -l, --command=COMMAND
   Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
 -s, --subcommand=SUBCOMMAND
   Specify subcommand
 -S, --sessionfile=SESSIONFILE
   Specify a filename to store sessions for faster authentication
 -x, --exclude=<black_list>
   Specify black list
 -o, --options=<additional_options>
   Specify additional command options
 -t, --timeout=INTEGER
   Seconds before plugin times out (default: 30)
 -v, --verbose
   Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ means blank or not specified parameter) :
    Common options for VM, Host and DC :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            ^ all cpu info
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info
        * io - shows disk io info
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            ^ all disk io info
        * runtime - shows runtime info
            + status - overall host status (gray/green/red/yellow)
            + issues - all issues for the host
            ^ all runtime info
    VM specific :
        * cpu - shows cpu info
            + wait - CPU wait time in ms
            + ready - CPU ready time in ms
        * mem - shows mem info
            + swapin - swapin mem usage in MB
            + swapout - swapout mem usage in MB
            + active - active mem usage in MB
        * io - shows disk I/O info
            + usage - overall disk usage in MB/s
        * runtime - shows runtime info
            + con - connection state
            + cpu - allocated CPU in MHz
            + mem - allocated mem in MB
            + state - virtual machine state (UP, DOWN, SUSPENDED)
            + consoleconnections - console connections to VM
            + guest - guest OS status, needs VMware Tools
            + tools - VMWare Tools status
    Host specific :
        * net - shows net info
            + nic - makes sure all active NICs are plugged in
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
            ^ all datastore info
        * runtime - shows runtime info
            + con - connection state
            + health - checks cpu/storage/memory/sensor status
            + maintenance - shows whether host is in maintenance mode
            + list(vm) - list of VMWare machines and their statuses
        * service - shows Host service info
            + (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>
            ^ show all services
        * storage - shows Host storage info
            + adapter - list bus adapters
            + lun - list SCSI logical units
            + path - list logical unit paths
    DC specific :
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
            ^ all datastore info
        * runtime - shows runtime info
            + list(vm) - list of VMWare machines and their statuses
            + listhost - list of VMWare esx host servers and their statuses
            + tools - VMWare Tools status
        * recommendations - shows recommendations for cluster
            + (name) - recommendations for cluster with name (name)
            ^ all clusters recommendations


Copyright (c) 2008 op5

rkennedy · Post by **rkennedy** » Mon Feb 06, 2017 11:57 am

You'll need to modify the plugin to do as you're after since the default for an unknown is critical. Take a look at this part for example -

Code: Select all

sub host_cpu_info
{
	my ($host, $np, $subcommand, $addopts) = @_;

	my $res = CRITICAL;
	my $output = 'HOST CPU Unknown error';

You would need to change $res and $result in certain places to be equal to OK, so that it can exit properly. (this is just at a glance, I did not dive deeply into the code or do further testing - this is a bit beyond what we generally provide support for)

tmcdonald · Post by **tmcdonald** » Wed Mar 01, 2017 2:14 pm

Just checking in since we have not heard from you in a while. Did @rkennedy's post clear things up or has the issue otherwise been resolved?

Nagios Support Forum

Don't send email when error is "Unknown Error"

Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"

Re: Don't send email when error is "Unknown Error"