Page 1 of 2

Monitor Network I/O Disk I/O and Latency for ESX

Posted: Thu Nov 26, 2020 10:57 am
by pratikmehta003
Hi There,

For ESX is there any way to monitor Network I/O Disk I/O and Latency?

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Tue Dec 01, 2020 1:45 pm
by ssax
You should be able to use Configure > Confirguration Wizards > VMWare and point it against your ESX host and it should monitor those by default, they are the NET and IO ones.

https://assets.nagios.com/downloads/nag ... ios-XI.pdf

To check Network:

Code: Select all

[nagios@xid ~]# /usr/local/nagios/libexec/check_vmware_api.pl -H "192.168.X.X" -f "/usr/local/nagiosxi/etc/components/vmware/192_168_X_X_auth.txt" -l "NET"
CHECK_VMWARE_API.PL OK - net receive=3.00 KBps, send=0.00 KBps, all 1 NICs are connected | net_receive=3.00;; net_send=0.00;; OK_NICs=1;; Bad_NICs=0;;
To check Disk:

Code: Select all

[nagios@xid ~]# /usr/local/nagios/libexec/check_vmware_api.pl -H "192.168.X.X" -f "/usr/local/nagiosxi/etc/components/vmware/192_168_X_X_auth.txt" -l "IO"
CHECK_VMWARE_API.PL OK - io commands aborted=0, io bus resets=0, io read latency=0 ms, write latency=0 ms, kernel latency=0 ms, device latency=0 ms, queue latency=0 ms | io_aborted=0;; io_busresets=0;; io_read=0ms;; io_write=0ms;; io_kernel=0ms;; io_device=0ms;; io_queue=0ms;;
To see everything that the plugin supports, do this:

Code: Select all

[nagios@xid ~]# /usr/local/nagios/libexec/check_vmware_api.pl -h
check_vmware_api.pl 0.7.1

This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).

VMware ESX/vSphere plugin

Usage: check_vmware_api.pl -D <data_center> | -H <host_name> [ -C <cluster_name> ] [ -N <vm_name> ]
    -u <user> -p <pass> | -f <authfile>
    -l <command> [ -s <subcommand> ] [ -T <timeshift> ] [ -i <interval> ]
    [ -x <black_list> ] [ -o <additional_options> ]
    [ -t <timeout> ] [ -w <warn_range> ] [ -c <crit_range> ]
    [ -V ] [ -h ]

 -?, --usage
   Print usage information
 -h, --help
   Print detailed help screen
 -V, --version
   Print version information
 --extra-opts=[section][@file]
   Read options from an ini file. See https://nagios-plugins.org/doc/extra-opts.html
   for usage and examples.
 -H, --host=<hostname>
   ESX or ESXi hostname.
 -C, --cluster=<clustername>
   ESX or ESXi clustername.
 -D, --datacenter=<DCname>
   Datacenter hostname.
 -N, --name=<vmname>
   Virtual machine name.
 -u, --username=<username>
   Username to connect with.
 -p, --password=<password>
   Password to use with the username.
 -f, --authfile=<path>
   Authentication file with login and password. File syntax :
   username=<login>
   password=<password>
 -w, --warning=THRESHOLD
   Warning threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format. By default, no threshold is set.
 -c, --critical=THRESHOLD
   Critical threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format. By default, no threshold is set.
 -l, --command=COMMAND
   Specify command type (CPU, MEM, NET, IO, VMFS, RUNTIME, ...)
 -s, --subcommand=SUBCOMMAND
   Specify subcommand
 -S, --sessionfile=SESSIONFILE
   Specify a filename to store sessions for faster authentication
 -x, --exclude=<black_list>
   Specify black list
 -o, --options=<additional_options>
   Specify additional command options (quickstats, ...)
 -T, --timestamp=<timeshift>
   Timeshift in seconds that could fix issues with "Unknown error". Use values like 5, 10, 20, etc
 -i, --interval=<sampling period>
   Sampling Period in seconds. Basic historic intervals: 300, 1800, 7200 or 86400. See config for any changes.
   Supports literval values to autonegotiate interval value: r - realtime interval, h<number> - historical interval specified by position.
   Default value is 20 (realtime). Since cluster does not have realtime stats interval other than 20(default realtime) is mandatory.
 -M, --maxsamples=<max sample count>
   Maximum number of samples to retrieve. Max sample number is ignored for historic intervals.
   Default value is 1 (latest available sample).
 --trace=<level>
   Set verbosity level of vSphere API request/respond trace
 --generate_test=<file>
   Generate a test case script from the executed command/subcommand and write it to <file>.   If <file> is "stdout", the test case script is written to stdout instead.
 -t, --timeout=INTEGER
   Seconds before plugin times out (default: 30)
 -v, --verbose
   Show details for command-line debugging (can repeat up to 3 times)
Supported commands(^ - blank or not specified parameter, o - options, T - timeshift value, b - blacklist) :
    VM specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            + wait - CPU wait time in ms
            + ready - CPU ready time in ms
            ^ all cpu info(no thresholds)
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
            + swapin - swapin mem usage in MB
            + swapout - swapout mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + active - active mem usage in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info(except usage and no thresholds)
        * io - shows disk I/O info
            + usage - overall disk usage in MB/s
            + read - read disk usage in MB/s
            + write - write disk usage in MB/s
            ^ all disk io info(no thresholds)
        * runtime - shows runtime info
            + con - connection state
            + cpu - allocated CPU in MHz
            + mem - allocated mem in MB
            + state - virtual machine state (UP, DOWN, SUSPENDED)
            + status - overall object status (gray/green/red/yellow)
            + consoleconnections - console connections to VM
            + guest - guest OS status, needs VMware Tools
            + tools - VMware Tools status
            + issues - all issues for the host
            ^ all runtime info(except con and no thresholds)
    Host specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemhz - CPU usage in MHz
                o quickstats - switch for query either PerfCounter values or Runtime info
            ^ all cpu info
                o quickstats - switch for query either PerfCounter values or Runtime info
        * mem - shows mem info
            + usage - mem usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemb - mem usage in MB
                o quickstats - switch for query either PerfCounter values or Runtime info
            + swap - swap mem usage in MB
                o listvm - turn on/off output list of swapping VM's
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
                o listvm - turn on/off output list of ballooning VM's
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            + nic - makes sure all active NICs are plugged in
            ^ all net info(except usage and no thresholds)
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
            ^ all disk io info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o brief - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o brief - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
        * runtime - shows runtime info
            + con - connection state
            + health - checks cpu/storage/memory/sensor status and propagates worst state
                o listitems - list all available sensors(use for listing purpose only)
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + storagehealth - storage status check
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + temperature - temperature sensors
                o blackregexpflag - whether to treat blacklist as regexp
                b - blacklist status objects
            + sensor - threshold specified sensor
            + maintenance - shows whether host is in maintenance mode
                o maintwarn - sets warning state when host is in maintenance mode
                o maintcrit - sets critical state when host is in maintenance mode
            + list(vm) - list of VMware machines and their statuses
            + status - overall object status (gray/green/red/yellow)
            + issues - all issues for the host
                b - blacklist issues
            ^ all runtime info(health, storagehealth, temperature and sensor are represented as one value and no thresholds)
        * service - shows Host service info
            + (names) - check the state of one or several services specified by (names), syntax for (names):<service1>,<service2>,...,<serviceN>
            ^ show all services
        * storage - shows Host storage info
            + adapter - list bus adapters
                b - blacklist adapters
            + lun - list SCSI logical units
                b - blacklist LUN's
            + path - list logical unit paths
                b - blacklist paths
            ^ show all storage info
        * uptime - shows Host uptime
                o quickstats - switch for query either PerfCounter values or Runtime info
        * device - shows Host specific device info
            + cd/dvd - list vm's with attached cd/dvd drives
                o listall - list all available devices(use for listing purpose only)
    DC specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemhz - CPU usage in MHz
                o quickstats - switch for query either PerfCounter values or Runtime info
            ^ all cpu info
                o quickstats - switch for query either PerfCounter values or Runtime info
        * mem - shows mem info
            + usage - mem usage in percentage
                o quickstats - switch for query either PerfCounter values or Runtime info
            + usagemb - mem usage in MB
                o quickstats - switch for query either PerfCounter values or Runtime info
            + swap - swap mem usage in MB
            + overhead - additional mem used by VM Server in MB
            + overall - overall mem used by VM Server in MB
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
            ^ all mem info(except overall and no thresholds)
        * net - shows net info
            + usage - overall network usage in KBps(Kilobytes per Second)
            + receive - receive in KBps(Kilobytes per Second)
            + send - send in KBps(Kilobytes per Second)
            ^ all net info(except usage and no thresholds)
        * io - shows disk io info
            + aborted - aborted commands count
            + resets - bus resets count
            + read - read latency in ms (totalReadLatency.average)
            + write - write latency in ms (totalWriteLatency.average)
            + kernel - kernel latency in ms
            + device - device latency in ms
            + queue - queue latency in ms
            ^ all disk io info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o brief - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o brief - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
        * runtime - shows runtime info
            + list(vm) - list of VMware machines and their statuses
            + listhost - list of VMware esx host servers and their statuses
            + listcluster - list of VMware clusters and their statuses
            + tools - VMware Tools status
                b - blacklist VM's
            + status - overall object status (gray/green/red/yellow)
            + issues - all issues for the host
                b - blacklist issues
            ^ all runtime info(except cluster and tools and no thresholds)
        * recommendations - shows recommendations for cluster
            + (name) - recommendations for cluster with name (name)
            ^ all clusters recommendations
    Cluster specific :
        * cpu - shows cpu info
            + usage - CPU usage in percentage
            + usagemhz - CPU usage in MHz
            ^ all cpu info
        * mem - shows mem info
            + usage - mem usage in percentage
            + usagemb - mem usage in MB
            + swap - swap mem usage in MB
                o listvm - turn on/off output list of swapping VM's
            + memctl - mem used by VM memory control driver(vmmemctl) that controls ballooning
                o listvm - turn on/off output list of ballooning VM's
            ^ all mem info(plus overhead and no thresholds)
        * cluster - shows cluster services info
            + effectivecpu - total available cpu resources of all hosts within cluster
            + effectivemem - total amount of machine memory of all hosts in the cluster
            + failover - VMware HA number of failures that can be tolerated
            + cpufairness - fairness of distributed cpu resource allocation
            + memfairness - fairness of distributed mem resource allocation
            ^ only effectivecpu and effectivemem values for cluster services
        * runtime - shows runtime info
            + list(vm) - list of VMware machines in cluster and their statuses
            + listhost - list of VMware esx host servers in cluster and their statuses
            + status - overall cluster status (gray/green/red/yellow)
            + issues - all issues for the cluster
                b - blacklist issues
            ^ all cluster runtime info
        * vmfs - shows Datastore info
            + (name) - free space info for datastore with name (name)
                o used - output used space instead of free
                o brief - list only alerting volumes
                o regexp - whether to treat name as regexp
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh
            ^ all datastore info
                o used - output used space instead of free
                o brief - list only alerting volumes
                o blacklistregexp - whether to treat blacklist as regexp
                b - blacklist VMFS's
                T (value) - timeshift to detemine if we need to refresh


Copyright (c) 2008-2013 op5

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Wed Dec 02, 2020 2:19 am
by pratikmehta003
Thanks for sharing this..
so these metrics can be configured on the host and VMs separately as well right?

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Wed Dec 02, 2020 4:05 pm
by ssax
The plugin supports multiple options:

Code: Select all

VM specific
Host specific
DC specific
Cluster specific
Based on the plugin help the NET/IO ones are only available when querying these:

Code: Select all

VM specific
Host specific
DC specific

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Thu Dec 03, 2020 6:50 am
by pratikmehta003
okay.. let me check.. m just working testing for 1 ESX host..

and another question... when i select the option monitoring Guest VM... this will basically monitor the metrics of the Guest VMs on the ESX host name but does it also add the Guest VMs separately?

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Thu Dec 03, 2020 5:52 pm
by ssax
Each guest VM you select will show up as a separate service under that single host.

If you want to monitor the OS metrics inside the VM (and have a host for it) you would set it up like one of your other Windows/Linux hosts and monitor it via one of the other wizards for that OS type.

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Mon Dec 07, 2020 10:09 am
by pratikmehta003
we are monitoring the OS separately anyways...

so the customer wants to monitor the Network I/O Disk I/O and Latency for below 2 points:
1. on ESX itself
2. for the VMs residing on the ESX...

So do u think selecting the option ' monitoring guest VM during the Vmware wizard should be selected OR some other way?
I did add the 2 esx hosts we have with 1st option and i can see the networking, latency service listed...

apart from above, i also want to know how to configure the threshold or condition for these parameters.. any insights on that will also help..

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Mon Dec 07, 2020 6:37 pm
by ssax
When monitoring an ESX Host, choose Monitor the VMware host.

When monitoring a VM, choose Monitor a guest VM on the VMWare host.

Please see the help output of the plugin:

Code: Select all

/usr/local/nagios/libexec/check_vmware_api.pl --help
It lists the threshold format here (linked from the help output):

http://nagiosplug.sourceforge.net/devel ... HOLDFORMAT

Code: Select all

 -w, --warning=THRESHOLD
   Warning threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format. By default, no threshold is set.
 -c, --critical=THRESHOLD
   Critical threshold. See
   http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
   for the threshold format. By default, no threshold is set.

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Wed Dec 09, 2020 6:23 am
by pratikmehta003
Need one help...i figured out the information but can u assist for threshold part.attached some images and below examples that m trying to use..

For IO, if i take the device/read parameter, how do i need to set the threshold? like if i set as below:
-w 0 -c 1 -s 'read' then i get output as OK... attached image for the same.. same way for IO also i get...
device latency screenshot.PNG
Read latency screenshot.PNG

Re: Monitor Network I/O Disk I/O and Latency for ESX

Posted: Wed Dec 09, 2020 6:33 pm
by ssax
You have the formatting proper you just need to define what a valid latency would be for your system.

So, from your current output we see it says 0ms for the time it took (which is good because the lower the ms latency/reponse the better). The check is only comparing that result to the -w and -c you pass in.

It's really up to you to define what ms latency/response would be considered high/impactful/not normal for your environment, I'd start with 25ms/50ms but you may need to adjust it if it keeps hitting that value, it will be unique to your system. Like this for example:

Code: Select all

/usr/local/nagios/libexec/check_vmware_api.pl -H "10.1.44.107" -f "/usr/local/nagiosxi/etc/components/vmware/ucpresxihst02_VMware_Host_auth.txt" -l "IO" -s 'read' -w 25 -c 50
That will warn if it's above 25ms and go critical if greater than 50ms.

The same goes for the other command.