Page 1 of 1

Don't send email when error is "Unknown Error"

Posted: Tue Jan 31, 2017 9:18 am
by borrierulez
Hey everyone,

I'm using check_esx plugin to minior esx hosts, it works fine but 10 times per day i get false alerts: CHECK_ESX CRITICAL - HOST IO Unknown error

I don't know how to solve the plugin (prob bug) but i want to know a way to tell nagios to ignore this and certainly don't send emails, how can I do this?

Borrie

Re: Don't send email when error is "Unknown Error"

Posted: Tue Jan 31, 2017 2:39 pm
by rkennedy
I would look look at using the negate plugin, to have it run before, which will let you define the exit code for an unknown state.

For example -

Code: Select all

[root@centos7 libexec]# ./negate -u OK ./check_dummy 3
UNKNOWN
[root@centos7 libexec]# echo $?
0
The 3 with check_dummy tells it to exit with a 3 (UNKNOWN) state. Using negate, we tell it to flip -u to OK.

Re: Don't send email when error is "Unknown Error"

Posted: Thu Feb 02, 2017 2:03 pm
by borrierulez
RKennedy,

Thank you for your answer!

I've put it in like this:

command_line $USER1$/negate -u OK $USER1$/check_esx -H $HOSTADDRESS$ -u $USER11$ -p $USER12$ -l cpu -s usage -w $ARG1$ -c $ARG2$ -t 90 3

But still get these: CHECK_ESX CRITICAL - HOST CPU Unknown error

Am i doing something wrong?

Re: Don't send email when error is "Unknown Error"

Posted: Thu Feb 02, 2017 2:13 pm
by rkennedy
The error message will still say unknown, but the status in Nagios should change to be OK.

Ah - taking a step back here, it looks like it's perhaps exiting on a CRITICAL state since the unknown error is actually in the message. What state is currently reported in Nagios?

Re: Don't send email when error is "Unknown Error"

Posted: Fri Feb 03, 2017 5:05 am
by borrierulez
It is indeed: State: CRITICAL

Re: Don't send email when error is "Unknown Error"

Posted: Fri Feb 03, 2017 10:58 am
by rkennedy
This will be difficult to mitigate without a wrapper script that is specifically identifying that text, and flipping the state.

What is the full output of check_esx -h? I'm trying to figure out which version you're using, and which plugin.

The one I'm looking at, check_esx3, has a feature built in for retries -

Code: Select all

    print "    -R retries: # of retries ([0..20]) for individual SNMP queries\n";
I'm wondering if you could utilize this to mitigate the unknown errors. The other option, is increasing your max_check_attempts to provide more time for a real check to come through, instead of a false positive.

Re: Don't send email when error is "Unknown Error"

Posted: Mon Feb 06, 2017 3:48 am
by borrierulez
Already made the max attempts higher..

check_esx 0.5.0

Code: Select all

This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).

VMWare Infrastructure plugin

Usage: check_esx -D <data_center> | -H <host_name> [ -N <vm_name> ]
    -u <user> -p <pass> | -f <authfile>
    -l <command> [ -s <subcommand> ]
    [ -x <black_list> ] [ -o <additional_options> ]
    [ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
    [ -V ] [ -h ]

 -?, --usage
   Print usage information
 -h, --help
   Print detailed help screen
 -V, --version
   Print version information
 --extra-opts=[section][@file]
   Read options from an ini file. See https://www.monitoring-plugins.org/doc/extra-opts.html
   for usage and examples.
 -H, --host=<hostname>
   ESX or ESXi hostname.
 -C, --cluster=<clustername>
   ESX or ESXi clustername.
 -D, --datacenter=<DCname>
   Datacenter hostname.
 -N, --name=<vmname>
   Virtual machine name.
 -u, --username=<username>
   Username to connect with.
 -p, --password=<password>
   Password to use with the username.
 -f, --authfile=<path>
   Authentication file with login and password. File syntax :
   username=<login>
   password=<password>
 -w, --warning=THRESHOLD
   Warning threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -c, --critical=THRESHOLD
   Critical threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -l, --command=COMMAND
   Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
 -s, --subcommand=SUBCOMMAND
   Specify subcommand
 -S, --sessionfile=SESSIONFILE
   Specify a filename to store sessions for faster authentication
 -x, --exclude=<black_list>
   Specify black list
 -o, --options=<additional_options>
   Specify additional command options
 -t, --timeout=INTEGER
   Seconds before plugin times out (default: 30)
 -v, --verbose
   Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ means blank or not specified parameter) :
    Common options for VM, Host and DC :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            ^ all cpu info
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info
        * io - shows disk io info
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            ^ all disk io info
        * runtime - shows runtime info
            + status - overall host status (gray/green/red/yellow)
            + issues - all issues for the host
            ^ all runtime info
    VM specific :
        * cpu - shows cpu info
            + wait - CPU wait time in ms
            + ready - CPU ready time in ms
        * mem - shows mem info
            + swapin - swapin mem usage in MB
            + swapout - swapout mem usage in MB
            + active - active mem usage in MB
        * io - shows disk I/O info
            + usage - overall disk usage in MB/s
        * runtime - shows runtime info
            + con - connection state
            + cpu - allocated CPU in MHz
            + mem - allocated mem in MB
            + state - virtual machine state (UP, DOWN, SUSPENDED)
            + consoleconnections - console connections to VM
            + guest - guest OS status, needs VMware Tools
            + tools - VMWare Tools status
    Host specific :
        * net - shows net info
            + nic - makes sure all active NICs are plugged in
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
            ^ all datastore info
        * runtime - shows runtime info
            + con - connection state
            + health - checks cpu/storage/memory/sensor status
            + maintenance - shows whether host is in maintenance mode
            + list(vm) - list of VMWare machines and their statuses
        * service - shows Host service info
            + (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>
            ^ show all services
        * storage - shows Host storage info
            + adapter - list bus adapters
            + lun - list SCSI logical units
            + path - list logical unit paths
    DC specific :
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
            ^ all datastore info
        * runtime - shows runtime info
            + list(vm) - list of VMWare machines and their statuses
            + listhost - list of VMWare esx host servers and their statuses
            + tools - VMWare Tools status
        * recommendations - shows recommendations for cluster
            + (name) - recommendations for cluster with name (name)
            ^ all clusters recommendations


Copyright (c) 2008 op5

Re: Don't send email when error is "Unknown Error"

Posted: Mon Feb 06, 2017 11:57 am
by rkennedy
You'll need to modify the plugin to do as you're after since the default for an unknown is critical. Take a look at this part for example -

Code: Select all

sub host_cpu_info
{
	my ($host, $np, $subcommand, $addopts) = @_;

	my $res = CRITICAL;
	my $output = 'HOST CPU Unknown error';
You would need to change $res and $result in certain places to be equal to OK, so that it can exit properly. (this is just at a glance, I did not dive deeply into the code or do further testing - this is a bit beyond what we generally provide support for)

Re: Don't send email when error is "Unknown Error"

Posted: Wed Mar 01, 2017 2:14 pm
by tmcdonald
Just checking in since we have not heard from you in a while. Did @rkennedy's post clear things up or has the issue otherwise been resolved?