Page 1 of 1

Network Device CPU/HARDWARE STATUS/MEMORY Monitoring

Posted: Mon Dec 10, 2018 1:20 pm
by vignesha
Hello Team,
We are monitoring CPU/MEMORY/HARDWARE STATUS for switches and firewalls but we are frequently reciving unknown alerts for the services even we have increase the timeout to 60Mins.
Therefore, instead of monitoring through Perl Plugin we have used OID’S for monitoring.
PFA,We have configured using the mentioned OID’S

URL'S:
with the help of below url we have achieved monitoring
CPU: https://www.cisco.com/c/en/us/support/d ... -snmp.html
Memory: https://www.cisco.com/c/en/us/support/d ... emory.html
Hardware: https://community.cisco.com/t5/network- ... -p/1731110

1.We need to combine Total, Free, Used Memory usage, as a single service is it possible. In addition, we need warning and critical limit need to set for the same. (i.e) warning 80% critical 90%
2. We need to set warning and critical threshold limit for the CPU Usage as well. (i.e) warning 80% critical 90%

please help us out for the same.

Re: Network Device CPU/HARDWARE STATUS/MEMORY Monitoring

Posted: Mon Dec 10, 2018 4:29 pm
by npolovenko
Hello, @vignesha. In order to set thresholds for these checks, you need to have a plugin with a conditional logic. It should compare SNMP output with your thresholds and return an exit code depending on whether it's larger or smaller.
https://nagios-plugins.org/doc/guidelines.html

Your best bet is to find a plugin on the nagios exchange that measures CPU and Memory. Most of the plugins on the exchange can alert on thresholds.
https://exchange.nagios.org/
https://exchange.nagios.org/index.php?o ... sco%20snmp

Re: Network Device CPU/HARDWARE STATUS/MEMORY Monitoring

Posted: Tue Dec 11, 2018 12:21 pm
by vignesha
Hi,

Actually from our end port capacity is 20Mbs because of that it keep on triggering alert in Nagios, hence we have paned to monitor directly using through OID’s and were Check_snmp plugin is helping to monitor.

Therefore, instead of getting “connection time” error we will get only the genuine alerts in Nagios.

Is there any other way we can include the conditional logic by manually?

Re: Network Device CPU/HARDWARE STATUS/MEMORY Monitoring

Posted: Tue Dec 11, 2018 2:03 pm
by npolovenko
@vignesha
even we have increase the timeout to 60Mins.
The timeout is measured in seconds, so its 60 seconds. Try increasing the timeout to 120 seconds.
We need to combine Total, Free, Used Memory usage, as a single service is it possible. In addition, we need warning and critical limit need to set for the same. (i.e) warning 80% critical 90%
You could use the check_multi plugin to combine multiple checks into one.
https://exchange.nagios.org/directory/P ... ti/details

Or you can look for a custom plugin on the exchange that measures multiple memory parameters.
https://exchange.nagios.org/index.php?o ... sco%20snmp

We need to set warning and critical threshold limit for the CPU Usage as well. (i.e) warning 80% critical 90%
Have you tried adding -w and -c thresholds to the check_snmp CPU usage check?

Another option is to look for a custom plugin on the exchange:
https://exchange.nagios.org/index.php?o ... snmp%20cpu
Actually from our end port capacity is 20Mbs because of that it keep on triggering alert in Nagios, hence we have paned to monitor directly using through OID’s and were Check_snmp plugin is helping to monitor.
I'm not sure how 20MBs bandwidth limitation can make SNMP checks timeout. What plugin were you using originally? Can you demontrate the command?

Re: Network Device CPU/HARDWARE STATUS/MEMORY Monitoring

Posted: Wed Dec 12, 2018 7:41 am
by vignesha
Hi Npolovenko,

As per your suggestion, we have previously configured monitor with the help of attachment plugins CPU, Memory and Hardware Status.
Therefore, we have configured using check_snmp plugin but we are facing issue for CPU and Memory Usage
CPU usage its giving 5sec, 1 min & 5 min we have to set the threshold limit to 80% warning and 90% critical but if you see my below output even though I reduced the threshold limit to 30% warning and 40% critical output is OK only.
CPU:
[nagios@ xxxxxxx ~]$ /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o 1.3.6.1.4.1.9.2.1.56.0 -o 1.3.6.1.4.1.9.2.1.57.0 -o 1.3.6.1.4.1.9.2.1.58.0 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd='xxxxxx' -x des -m IF-MIB -w 80,70,60 -c 90,80,70
SNMP OK - 9 10 9 | SNMPv2-SMI::enterprises.9.2.1.56.0=9;80;90; SNMPv2-SMI::enterprises.9.2.1.57.0=10;70;80; SNMPv2-SMI::enterprises.9.2.1.58.0=9;60;70;
[root@ xxxxxxx libexec]# /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o 1.3.6.1.4.1.9.2.1.56.0 -o 1.3.6.1.4.1.9.2.1.57.0 -o 1.3.6.1.4.1.9.2.1.58.0 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd='xxxxxxx' -x des -m IF-MIB -w 30,20,10 -c 40,30,10
SNMP OK - 8 8 8 | SNMPv2-SMI::enterprises.9.2.1.56.0=8;30;40; SNMPv2-SMI::enterprises.9.2.1.57.0=8;20;30; SNMPv2-SMI::enterprises.9.2.1.58.0=8;10;10;

Memory Usage is giving Total, Used and Free memory is there any way we can combine all into one with the check_snmp plugin and for used and free memory outputs are giving in Octet format so we tried to set threshold limit
Ex.
514817856 x 100 / 80% = 4118542848
514817856 x 100 / 90% = 4633257528
We have first tried with the 411854284 warning and 463325752 critical threshold limit but its giving critical only but again we have giving the threshold limit 4118542848 warning and 4633257528 critical it giving ok.

Actually, I do not understand the scenario how to set the threshold limit for octet, if I get help in this means issue will be resolved for setting threshold limit.
Also i need another help is it possible to combine TOTAL, USED and FREE Memory with the help of check_snmp it can be achievable?

MEMORY:
TOTAL MEMORY:
[nagios@xxxxxxx ~]$ /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o 1.3.6.1.4.1.9.3.6.6.0 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd='xxxxxx' -x des -m IF-MIB
SNMP OK - 805400576 | SNMPv2-SMI::enterprises.9.3.6.6.0=805400576
USED MEMORY:
[nagios@ xxxxxxx ~]$ /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o 1.3.6.1.4.1.9.9.48.1.1.1.5.1 -o 1.3.6.1.4.1.9.9.48.1.1.1.5.2 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd='xxxxxx' -x des -m IF-MIB -w 2400000000,1600000 -c 2700000000,1800000
SNMP OK - 290384016 195072 | SNMPv2-SMI::enterprises.9.9.48.1.1.1.5.1=290384016;2400000000;2700000000; SNMPv2-SMI::enterprises.9.9.48.1.1.1.5.2=195072;1600000;1800000;
Free Memory:
[root@xxxxxxxxxx libexec]# /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o 1.3.6.1.4.1.9.9.48.1.1.1.6.1 -o 1.3.6.1.4.1.9.9.48.1.1.1.6.2 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd='xxxxx' -x des -m IF-MIB -w 4118451136 -c 4633257528
SNMP OK - 514817856 20771552 | SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.1=514817856;4118451136;4633257528; SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.2=20771552
[root@xxxxxxxxx libexec]# /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o 1.3.6.1.4.1.9.9.48.1.1.1.6.1 -o 1.3.6.1.4.1.9.9.48.1.1.1.6.2 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd='xxxxxxx' -x des -m IF-MIB -w 411845113 -c 463325752
SNMP CRITICAL - *514817848* 20771552 | SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.1=514817848;411845113;463325752; SNMPv2-SMI::enterprises.9.9.48.1.1.1.6.2=20771552
HARDWARE STATUS:
[nagios@xxxxxxx ~]$ /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o enterprises.9.9.13.1.3.1.3.1 -o enterprises.9.9.13.1.3.1.3.2 -o enterprises.9.9.13.1.3.1.3.3 -o enterprises.9.9.13.1.3.1.3.11 -o enterprises.9.9.13.1.3.1.3.12 -o enterprises.9.9.13.1.3.1.3.13 -o enterprises.9.9.13.1.3.1.3.21 -o enterprises.9.9.13.1.3.1.3.22 -o enterprises.9.9.13.1.3.1.3.23 -o enterprises.9.9.13.1.3.1.3.24 -o enterprises.9.9.13.1.3.1.3.25 -o enterprises.9.9.13.1.3.1.3.26 -o enterprises.9.9.13.1.3.1.3.27 -o enterprises.9.9.13.1.3.1.3.41 -o enterprises.9.9.13.1.3.1.3.42 -o enterprises.9.9.13.1.5.1.2.1 -o enterprises.9.9.13.1.5.1.2.2 -o enterprises.9.9.13.1.4.1.2.1 -o enterprises.9.9.13.1.4.1.2.2 -o enterprises.9.9.13.1.4.1.2.3 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd='xxxxxx' -x des -m IF-MIB -c 50,50,50,50,50,50,50,50,50,50,50,50,50,50,50
SNMP OK - 23 31 27 22 30 25 28 22 37 44 38 43 27 24 22 "Power Supply 1" "Power Supply 2" "Chassis Fan Tray 1" "Power Supply 1 Fan" "Power Supply 2 Fan" | SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.1=23;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.2=31;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.3=27;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.11=22;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.12=30;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.13=25;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.21=28;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.22=22;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.23=37;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.24=44;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.25=38;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.26=43;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.27=27;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.41=24;;50; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.42=22;;50;
Critical Alert:
[root@ xxxxxx libexec]# /usr/local/nagios/libexec/check_snmp -H xx.xxx.xx.xxx -p 161 -o enterprises.9.9.13.1.3.1.3.1 -o enterprises.9.9.13.1.3.1.3.2 -o enterprises.9.9.13.1.3.1.3.3 -o enterprises.9.9.13.1.3.1.3.11 -o enterprises.9.9.13.1.3.1.3.12 -o enterprises.9.9.13.1.3.1.3.13 -o enterprises.9.9.13.1.3.1.3.21 -o enterprises.9.9.13.1.3.1.3.22 -o enterprises.9.9.13.1.3.1.3.23 -o enterprises.9.9.13.1.3.1.3.24 -o enterprises.9.9.13.1.3.1.3.25 -o enterprises.9.9.13.1.3.1.3.26 -o enterprises.9.9.13.1.3.1.3.27 -o enterprises.9.9.13.1.3.1.3.41 -o enterprises.9.9.13.1.3.1.3.42 -o enterprises.9.9.13.1.5.1.2.1 -o enterprises.9.9.13.1.5.1.2.2 -o enterprises.9.9.13.1.4.1.2.1 -o enterprises.9.9.13.1.4.1.2.2 -o enterprises.9.9.13.1.4.1.2.3 -P 3 --seclevel=authNoPriv --secname=nagios --authproto=MD5 --authpasswd=' xxxxxx ' -x des -m IF-MIB -c 30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
SNMP CRITICAL - 24 *32* 28 23 *31* 27 30 24 *39* *46* *39* *45* 29 26 24 "Power Supply 1" "Power Supply 2" "Chassis Fan Tray 1" "Power Supply 1 Fan" "Power Supply 2 Fan" | SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.1=24;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.2=32;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.3=28;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.11=23;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.12=31;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.13=27;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.21=30;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.22=24;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.23=39;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.24=46;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.25=39;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.26=45;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.27=29;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.41=26;;30; SNMPv2-SMI::enterprises.9.9.13.1.3.1.3.42=24;;30;

As per your concern, we have increased 120 seconds. PFB output errors
Commands:
$USER1$/check_snmp_load.pl -H $HOSTADDRESS$ $ARG1$ -t 60
$USER1$/check_snmp_mem.pl -H $HOSTADDRESS$ $ARG1$ -t 60
$USER1$/check_snmp_environment.pl -H $HOSTADDRESS$ $ARG1$ $ARG2$ -t 60

[root@xxxxxxxx libexec]# /usr/local/nagios/libexec/check_snmp_load.pl -H xx.xxx.xx.xxx -l nagios -x xxxxxxx -L md5 -T cisco -w 80,70,60 -c 90,80,70 -f -t 120
Timeout must be >1 and <60 !
Usage: /usr/local/nagios/libexec/check_snmp_load.pl [-v] -H <host> -C <snmp_community> [-2] | (-l login -x passwd [-X pass -L <authp>,<privp>]) [-p <port>] -w <warn level> -c <crit level> -T=[stand|netsl|netsc|as400|cisco|cata|nsc|fg|bc|nokia|hp|lp|hpux] [-f] [-t <timeout>] [-V]
[root@xxxxxxx libexec]# /usr/local/nagios/libexec/check_snmp_mem.pl -H xx.xxx.xx.xxx -l nagios -x xxxxxxx -L md5 -I -w 80 -c 90 -f -t 120
Timeout must be >1 and <60 !
Usage: /usr/local/nagios/libexec/check_snmp_mem.pl [-v] -H <host> -C <snmp_community> [-2] | (-l login -x passwd [-X pass -L <authp>,<privp>]) [-p <port>] -w <warn level> -c <crit level> [-I|-N|-E] [-f] [-m] [-t <timeout>] [-V]

Thanks, for the url’s from that we got python plugin were it can monitor the Hardware status only for SNMP V2, if possible please suggest us V3 python plugin.

Please provide update for CPU & MEMORY as well.

Re: Network Device CPU/HARDWARE STATUS/MEMORY Monitoring

Posted: Thu Dec 13, 2018 12:06 pm
by npolovenko
@vignesha, The check_snmp plugin is a generic plugin and the settings need to be adjusted to the output of the OID that you're checking. For example, your CPU check seems to be returning the following for the load "9 10 9" but your warning and critical thresholds are 80,70,60 and 90,80,70. This will show an OK state.
I suggest adding one of the following flags to the check_snmp command in order to try to convert the rate:
--rate
    Enable rate calculation. See 'Rate Calculation' below
--rate-multiplier
    Converts rate per second. For example, set to 60 to convert to per minute
You will have to experiment with these options until the output is on the same scale as your thresholds.

However, if you used the check_snmp_load.pl plugin, it would take care of this calculation for you.

To combine memory checks you need to use the check_snmp_mem.pl plugin. I see that it says you can't set the timeout to 120 but you could try changing the source code. Change this line:
if (!defined($o_timeout)) {$o_timeout=5;}
To:
if (!defined($o_timeout)) {$o_timeout=120;}
This will change the default timeout value. Don't add -t argument to the command.
Also, NET::SNMP perl module has a default maximum timeout set to 60. let's increase that to 120. Edit this file:
/usr/share/perl5/Net/SNMP/Transport.pm
Find this line:
sub TIMEOUT_MAXIMUM() { 60.0 }
And change it to:
sub TIMEOUT_MAXIMUM() { 120.0 }
That will set the max timeout to 120 seconds.

Additionally, you need to edit the /usr/local/nagios/etc/nagios.cfg file and change:
service_check_timeout=60
to
service_check_timeout=120
And then run:
service nagios restart