VMware Monitoring...

Engineer1974 · Post by **Engineer1974** » Thu Jan 06, 2022 4:09 pm

Folks,

I'm looking for some guidance on the best way to do VMware host resource and datastore monitoring in our environment. We have a vCenter Server managing our systems so we'd like like to centralize the monitoring of datastores, clusters and other shared resources thru vCenter to avoid duplicating alerts and then monitor the VM's separately as their individual OS's dictate. I haven't had much luck with the vCenter side of it since it only seems to be able to do individual hosts or connect to vCenter for VM monitoring but not datastores and clusters. Any advice would be appreciated.

Thanks!

ssax · Post by **ssax** » Fri Jan 07, 2022 11:09 am

The VMware configuration wizard should be able to monitor datastores/etc through vcenter or directly to the esxi hosts, see the guide here:

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

You can see all the options that the plugin supports by passing -h:

Code: Select all

[root@xid ~]# /usr/local/nagios/libexec/check_vmware_api.pl -h
check_vmware_api.pl 0.7.1

This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).

VMware ESX/vSphere plugin

Usage: check_vmware_api.pl -D <data_center> | -H <host_name> [ -C <cluster_name> ] [ -N <vm_name> ]
    -u <user> -p <pass> | -f <authfile>
    -l <command> [ -s <subcommand> ] [ -T <timeshift> ] [ -i <interval> ]
    [ -x <black_list> ] [ -o <additional_options> ]
    [ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
    [ -V ] [ -h ]

 -?, --usage
   Print usage information
 -h, --help
   Print detailed help screen
 -V, --version
   Print version information
 --extra-opts=[section][@file]
   Read options from an ini file. See https://nagios-plugins.org/doc/extra-opts.html
   for usage and examples.
 -H, --host=<hostname>
   ESX or ESXi hostname.
 -C, --cluster=<clustername>
   ESX or ESXi clustername.
 -D, --datacenter=<DCname>
   Datacenter hostname.
 -N, --name=<vmname>
   Virtual machine name.
 -u, --username=<username>
   Username to connect with.
 -p, --password=<password>
   Password to use with the username.
 -f, --authfile=<path>
   Authentication file with login and password. File syntax :
   username=<login>
   password=<password>
 -w, --warning=THRESHOLD
   Warning threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format. By default, no threshold is set.
 -c, --critical=THRESHOLD
   Critical threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format. By default, no threshold is set.
 -l, --command=COMMAND
   Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
 -s, --subcommand=SUBCOMMAND
   Specify subcommand
 -S, --sessionfile=SESSIONFILE
   Specify a filename to store sessions for faster authentication
 -x, --exclude=<black_list>
   Specify black list
 -o, --options=<additional_options>
   Specify additional command options (quickstats, ...)
 -T, --timestamp=<timeshift>
   Timeshift in seconds that could fix issues with "Unknown error". Use values like 5, 10, 20, etc
 -i, --interval=<sampling period>
   Sampling Period in seconds. Basic historic intervals: 300, 1800, 7200 or 86400. See config for any changes.
   Supports literval values to autonegotiate interval value: r - realtime interval, h<number> - historical interval specified b                                                                                                              y position.
   Default value is 20 (realtime). Since cluster does not have realtime stats interval other than 20(default realtime) is manda                                                                                                              tory.
 -M, --maxsamples=<max sample count>
   Maximum number of samples to retrieve. Max sample number is ignored for historic intervals.
   Default value is 1 (latest available sample).
 --trace=<level>
   Set verbosity level of vSphere API request/respond trace
 --generate_test=<file>
   Generate a test case script from the executed command/subcommand and write it to <file>.   If <file> is "stdout", the test c                                                                                                              ase script is written to stdout instead.
 -t, --timeout=INTEGER
   Seconds before plugin times out (default: 30)
 -v, --verbose
   Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ - blank or not specified parameter, o - options, T - timeshift value, b - blacklist) :
    VM specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            + wait - CPU wait time in ms
            + ready - CPU ready time in ms
            ^ all cpu info(no thresholds)
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
            + swapin - swapin mem usage in MB
            + swapout - swapout mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + active - active mem usage in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info(except usage and no thresholds)
        * io - shows disk I/O info
            + usage - overall disk usage in MB/s
            + read - read disk usage in MB/s
            + write - write disk usage in MB/s
            ^ all disk io info(no thresholds)
        * runtime - shows runtime info
            + con - connection state
            + cpu - allocated CPU in MHz
            + mem - allocated mem in MB
            + state - virtual machine state (UP, DOWN, SUSPENDED)
            + status - overall object status (gray/green/red/yellow)
            + consoleconnections - console connections to VM
            + guest - guest OS status, needs VMware Tools
            + tools - VMware Tools status
            + issues - all issues for the host
            ^ all runtime info(except con and no thresholds)
    Host specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemhz - CPU usage in MHz
                o quickstats - switch for query either PerfCounter values or Runtime info
            ^ all cpu info
                o quickstats - switch for query either PerfCounter values or Runtime info
        * mem - shows mem info
            + usage - mem usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemb - mem usage in MB
                o quickstats - switch for query either PerfCounter values or Runtime info
            + swap - swap mem usage in MB
                o listvm - turn on/off output list of swapping VM's
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
                o listvm - turn on/off output list of ballooning VM's
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            + nic - makes sure all active NICs are plugged in
            ^ all net info(except usage and no thresholds)
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
            ^ all disk io info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o brief - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o brief - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
        * runtime - shows runtime info
            + con - connection state
            + health - checks cpu/storage/memory/sensor status and propagates worst state
                o listitems - list all available sensors(use for listing purpose only)
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + storagehealth - storage status check
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + temperature - temperature sensors
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + sensor - threshold specified sensor
            + maintenance - shows whether host is in maintenance mode
                o maintwarn - sets warning state when host is in maintenance mode
                o maintcrit - sets critical state when host is in maintenance mode
            + list(vm) - list of VMware machines and their statuses
            + status - overall object status (gray/green/red/yellow)
            + issues - all issues for the host
                b - blacklist issues
            ^ all runtime info(health, storagehealth, temperature and sensor are represented as one value and no thresholds)
        * service - shows Host service info
            + (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service                                                                                                              2>,...,<serviceN>
            ^ show all services
        * storage - shows Host storage info
            + adapter - list bus adapters
                b - blacklist adapters
            + lun - list SCSI logical units
                b - blacklist LUN's
            + path - list logical unit paths
                b - blacklist paths
            ^ show all storage info
        * uptime - shows Host uptime
                o quickstats - switch for query either PerfCounter values or Runtime info
        * device - shows Host specific device info
            + cd/dvd - list vm's with attached cd/dvd drives
                o listall - list all available devices(use for listing purpose only)
    DC specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemhz - CPU usage in MHz
                o quickstats - switch for query either PerfCounter values or Runtime info
            ^ all cpu info
                o quickstats - switch for query either PerfCounter values or Runtime info
        * mem - shows mem info
            + usage - mem usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemb - mem usage in MB
                o quickstats - switch for query either PerfCounter values or Runtime info
            + swap - swap mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info(except usage and no thresholds)
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
            ^ all disk io info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o brief - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o brief - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
        * runtime - shows runtime info
            + list(vm) - list of VMware machines and their statuses
            + listhost - list of VMware esx host servers and their statuses
            + listcluster - list of VMware clusters and their statuses
            + tools - VMware Tools status
                b - blacklist VM's
            + status - overall object status (gray/green/red/yellow)
            + issues - all issues for the host
                b - blacklist issues
            ^ all runtime info(except cluster and tools and no thresholds)
        * recommendations - shows recommendations for cluster
            + (name) - recommendations for cluster with name (name)
            ^ all clusters recommendations
    Cluster specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            ^ all cpu info
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
                o listvm - turn on/off output list of swapping VM's
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
                o listvm - turn on/off output list of ballooning VM's
            ^ all mem info(plus overhead and no thresholds)
        * cluster - shows cluster services info
            + effectivecpu - total available cpu resources of all hosts within cluster
            + effectivemem - total amount of machine memory of all hosts in the cluster
            + failover - VMware HA number of failures that can be tolerated
            + cpufairness - fairness of distributed cpu resource allocation
            + memfairness - fairness of distributed mem resource allocation
            ^ only effectivecpu and effectivemem values for cluster services
        * runtime - shows runtime info
            + list(vm) - list of VMware machines in cluster and their statuses
            + listhost - list of VMware esx host servers in cluster and their statuses
            + status - overall cluster status (gray/green/red/yellow)
            + issues - all issues for the cluster
                b - blacklist issues
            ^ all cluster runtime info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o brief - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o brief - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh


Copyright (c) 2008-2013 op5

There's also this 3rd party one that requires a VMA server:

https://exchange.nagios.org/directory/P ... re/details

WillH · Post by **WillH** » Fri Jan 07, 2022 11:17 am

You'll need to download and install the vmware SDK (references in the doc Sean linked), and you may need to check the account permissions on some of the required perl components, depending on how your Linux server has been configured.
Took me and my team the better part of a morning to hunt that down

ssax · Post by **ssax** » Fri Jan 07, 2022 2:24 pm

Thanks @WillH!

A lot of the time if there is a custom perl layout (PERL5LIB/etc) or other perl/cpan/cpanm environmental variables have been set for perl/cpan/cpanm to install modules per-user OR if you've changed the umask of the system from the defaults you are more likely to have issues with permissions.

Usually, the issue is because of a per-user PERL5LIB or custom PERL envars being set by a custom /etc/profile.d script or they are being modified by the user's ~/.bashrc, ~/.bash_profile, the /etc/bashrc, or another location.

You can generally get around it by checking env for any custom PERL envars as the root AND nagios users:

Code: Select all

env
su - nagios
env

If there are any set (there shouldn't be by default), I unset them, and the run through the guide to install everything and it usually works properly for all users when installing as the root user.

But once the system is already in that per-user state I generally have to remove the custom perl envars from all of those profile files and remove these files as they would also be per-user for cpan:

Code: Select all

/root/.cpan/CPAN/MyConfig.pm
/home/nagios/.cpan/CPAN/MyConfig.pm

Then on the next cpan run (when you have the envars unset) it will rebuild the cpan MyConfig.pm file again and not try to install per-user.

A combination of those items generally resolves the issue on most systems I run into.

Engineer1974 · Post by **Engineer1974** » Mon Jan 10, 2022 4:31 pm

Thanks for the feedback everyone, I've got the VMware SDK installed and functional and pulling stats from our hosts. I'll be trying to configure some thresholds later today, and see how it goes....

gsmith · Post by **gsmith** » Mon Jan 10, 2022 4:52 pm

Hi

Sounds good. Thanks for keeping us updated.

WillH · Post by **WillH** » Tue Jan 11, 2022 12:14 pm

Good news!

Nagios Support Forum

VMware Monitoring...

VMware Monitoring...

Re: VMware Monitoring...

Re: VMware Monitoring...

Re: VMware Monitoring...

Re: VMware Monitoring...

Re: VMware Monitoring...

Re: VMware Monitoring...