check_esx3 : unable to raise alert for health & IO

sac1472 · Post by **sac1472** » Tue Mar 27, 2018 11:24 am

hello,
how to define threshold to raise an alert for esx runtime status, disk I/O, networking. As it just printing issues detected with OK status not going to critical/ warning state.

scottwilkerson · Post by **scottwilkerson** » Tue Mar 27, 2018 1:18 pm

you would need to run the subcommand
for example on the io command add

Code: Select all

-s read

Also, for you warning and critical values, if you just use 1 it means >1 so it would not alert if == 1

Code: Select all

Usage: check_esx3.pl -D <data_center> | -H <host_name> [ -N <vm_name> ]
    -u <user> -p <pass> | -f <authfile>
    -l <command> [ -s <subcommand> ]
    [ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
    [ -V ] [ -h ]

 -?, --usage
   Print usage information
 -h, --help
   Print detailed help screen
 -V, --version
   Print version information
 --extra-opts=[section][@file]
   Read options from an ini file. See https://nagios-plugins.org/doc/extra-opts.html
   for usage and examples.
 -H, --host=<hostname>
   ESX or ESXi hostname.
 -D, --datacenter=<DCname>
   Datacenter hostname.
 -N, --name=<vmname>
   Virtual machine name.
 -u, --username=<username>
   Username to connect with.
 -p, --password=<password>
   Password to use with the username.
 -f, --authfile=<path>
   Authentication file with login and password. File syntax :
   username=<login>
   password=<password>
 -w, --warning=THRESHOLD
   Warning threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -c, --critical=THRESHOLD
   Critical threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format.
 -l, --command=COMMAND
   Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
 -s, --subcommand=SUBCOMMAND
   Specify subcommand
 -S, --sessionfile=SESSIONFILE
   Specify a filename to store sessions for faster authentication
 -t, --timeout=INTEGER
   Seconds before plugin times out (default: 30)
 -v, --verbose
   Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ means blank or not specified parameter) :
    Common options for VM, Host and DC :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            ^ all cpu info
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            ^ all mem info
        * net - shows net info
            + usage - overall network usage in KB/s
            + receive - receive in KB/s
            + send - send in KB/s
            ^ all net info
        * io - shows disk io info
            + read - read latency in ms
            + write - write latency in ms
            ^ all disk io info
        * runtime - shows runtime info
            + status - overall host status (gray/green/red/yellow)
            + issues - all issues for the host
            ^ all runtime info
    VM specific :
        * cpu - shows cpu info
            + wait - CPU wait in ms
        * mem - shows mem info
            + swapin - swapin mem usage in MB
            + swapout - swapout mem usage in MB
            + active - active mem usage in MB
        * io - shows disk I/O info
            + usage - overall disk usage in MB/s
        * runtime - shows runtime info
            + con - connection state
            + cpu - allocated CPU in MHz
            + mem - allocated mem in MB
            + state - virtual machine state (UP, DOWN, SUSPENDED)
            + consoleconnections - console connections to VM
            + guest - guest OS status, needs VMware Tools
            + tools - VMWare Tools status
    Host specific :
        * net - shows net info
            + nic - makes sure all active NICs are plugged in
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
        * vmfs - shows Datastore info
            + (name) - info for datastore with name (name)
            ^ all datastore info
        * runtime - shows runtime info
            + con - connection state
            + health - checks cpu/storage/memory/sensor status
            + maintenance - shows whether host is in maintenance mode
            + list(vm) - list of VMWare machines and their statuses
        * service - shows Host service info
            + (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>
            ^ show all services
    DC specific :
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
        * vmfs - shows Datastore info
            + (name) - info for datastore with name (name)
            ^ all datastore info
        * runtime - shows runtime info
            + list(vm) - list of VMWare machines and their statuses
            + listhost - list of VMWare esx host servers and their statuses
        * recommendations - shows recommendations for cluster
            + (name) - recommendations for cluster with name (name)
            ^ all clusters recommendations


Copyright (c) 2008 op5

sac1472 · Post by **sac1472** » Wed Mar 28, 2018 7:34 am

But in that way, i would require to set different checks for disk read latency, write latency,.... which i don't want to configure it.

Regarding runtime, I'm unable to generate alert even if there is any health/config issue. It just prints errors with OK status.

esx1.PNG

Also, If I provide health as sub command it prints health issues with UNKNOWN status that should be CRITICAL.

esx.PNG

I want - 1) Disk I/O read, write latency in one check should raise alert if any latency time exceeds given thresholds.
2) Runtime with health/issues subcommand - should generate an critical alert if there is any health issue/config issue detected rather showing UNKNOWN/ OK

I think @ lmiltchev have idea about this. he solved my session file issue for check_esx3. so,@ lmiltchev could you please help me ?

Post by **lmiltchev** » Wed Mar 28, 2018 9:52 am

Hi @ sac1472,
I will try to help you with both issues.

1) Disk I/O read, write latency in one check should raise alert if any latency time exceeds given thresholds.

The check_esx3.pl plugin is not designed to work with passing two sub-commands at the same time, e.g. "-s read -s write", "-s read,write", etc. However, you could still set up one check for both (read and write) via check_multi:

https://exchange.nagios.org/directory/P ... ti/details

You could set up a simple config in the libexec directory, e.g. "multi.cfg", something like this:

Code: Select all

command [ IO_read ] = check_esx3.pl -D <ip address> -f /path/to/the/auth.txt -l IO -s read -w <warning value> -c <critical value>
command [ IO_write ] = check_esx3.pl -D <ip address> -f /path/to/the/auth.txt  -l IO -s write -w <warning value> -c <critical value>

Then, you could run:

Code: Select all

/usr/local/nagios/libexec/check_multi -f multi.cfg

Example:

Code: Select all

[nagios@main-nagios-xi libexec]$ ./check_multi -f multi.cfg
CRITICAL - 2 plugins checked, 1 critical (IO_write), 1 ok
[ 1] IO_read CHECK_ESX3.PL OK - io read latency=0 ms
[ 2] IO_write CHECK_ESX3.PL CRITICAL - io write latency=8 ms |check_multi::check_multi::plugins=2 time=1.771847 IO_read::check_esx3.pl::io_read=0ms;;1 IO_write::check_esx3.pl::io_write=8ms;;1

If you are unsure of how to set up your service check after you tested the plugin from the command line, please follow the steps, outlined in the document below:

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

2) Runtime with health/issues subcommand - should generate an critical alert if there is any health issue/config issue detected rather showing UNKNOWN/ OK

Are you saying that you are seeing errors in the output, however you are still getting an OK state? UNKNOWN errors are usually caused by mis-configuration...
It is possible that you are receiving an "OK", because (according to the output you shown us) you have "1 health issue(s)", and your critical threshold is "-c 1". So, you haven't exceeded the critical threshold, and the exit code must be "0". Can you run your command a few times with different threshold (and without passing any thresholds), and show the exit code? I just want to make sure that the plugin exits with the correct codes for you.

Examples:

Code: Select all

./check_esx3.pl.newest -D xxx -f /tmp/VC6 -H xxx -l runtime
echo $?
./check_esx3.pl.newest -D xxx -f /tmp/VC6 -H xxx -l runtime -c 1
echo $?
./check_esx3.pl.newest -D xxx -f /tmp/VC6 -H xxx -l runtime -c 0
echo $?

sac1472 · Post by **sac1472** » Fri Apr 06, 2018 10:26 am

Thanks @lmiltchev for detailed response.
1) I have configured check_multi & that's working fine.

2) i unable to generate an config issue on esx, for testing. but i found, one ESX is in red state still check_esx3 showing OK state. see below screenshot.

Post by **lmiltchev** » Fri Apr 06, 2018 1:41 pm

OK, it seems like that the runtime check includes various "sub-checks", e.g. con, health, storagehealth, etc. The threshold will not work, unless you specify a sub-command.

Try:

Code: Select all

./check_esx3.pl.newest -D xxx -f /tmp/VC6 -H xxx -l runtime -s status
echo $?

You can also try passing some thresholds, for example: "-c 1", "-c 0" to see if the output will change.

Note: If you don't specify a sub-command, the check will return "all runtime info".

* runtime - shows runtime info
+ con - connection state
+ health - checks cpu/storage/memory/sensor status and propagates worst state
o listitems - list all available sensors(use for listing purpose only)
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ storagehealth - storage status check
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ temperature - temperature sensors
o blackregexpflag - whether to treat blacklist as regexp
b - blacklist status objects
+ sensor - threshold specified sensor
+ maintenance - shows whether host is in maintenance mode
+ list(vm) - list of VMWare machines and their statuses
+ status - overall object status (gray/green/red/yellow)
+ issues - all issues for the host
b - blacklist issues
^ all runtime info(health, storagehealth, temperature and sensor are represented as one value and no thresholds)

sac1472 · Post by **sac1472** » Mon Apr 09, 2018 10:36 am

Yes, i'm able to get alert with sub-checks.
Actually, i want- It should raise alert without specifying any sub-checks. that will be best setup. let me know, if that's possible otherwise I will go with, check_multi with different sub-commands

Thanks for your help.

You can mark this topic as close.

Nagios Support Forum

check_esx3 : unable to raise alert for health & IO

check_esx3 : unable to raise alert for health & IO

Re: check_esx3 : unable to raise alert for health & IO

Re: check_esx3 : unable to raise alert for health & IO

Re: check_esx3 : unable to raise alert for health & IO

Re: check_esx3 : unable to raise alert for health & IO

Re: check_esx3 : unable to raise alert for health & IO

Re: check_esx3 : unable to raise alert for health & IO