Monitoring F5 Load balancers

aap · Post by **aap** » Fri Jul 15, 2016 11:58 am

Thanks for your help.

see attached

Post by **tgriep** » Fri Jul 15, 2016 12:32 pm

Can you login to the XI server as root, run the following commands and post the output?

/usr/local/nagios/libexec/check_f5.pl -H 10.127.220.150 -m fan -t 30 -C <Communityname>
/usr/local/nagios/libexec/check_f5.pl --help

Replace <Communityname> with the settings from your device.
If the above works, you will need to add the community name to your service check.

aap · Post by **aap** » Fri Jul 15, 2016 12:45 pm

See both results below

Average fan speed 10546 rpm.|fan_speed=10546;;;0;

Usage: /usr/local/nagios/libexec/check_f5.pl -H host [-C community] [-v][-d][-h][-M][-t timout]
[--no-perf] [--no-optimise]
[-f cachefile] [-x cacheexpiretime]
[-m modules]

-H --host : Specify F5 hostname (f5)
Can be used multiple times; the first to answer is used.
-C --community : Specify SNMP community string (public)
-d --debug : Debug mode. Use multiple times for more detail
-h --help : Show help
-M --mrtg : MRTG mode. Also --no-mrtg for Nagios mode.
-t --timeout : SNMP query timeout (10)
-f --cache-file : File basename for SNMP cache (/tmp/f5.cache)
-x --cache-expire : Seconds before cache becomes invalid (290)
--no-perf : Disable perfstats in Nagios output
--no-optimise : Retrieve entire SNMP tree for tables (use in conjunction
with cache if many separate server() checks being done)
-m --modules : List modules to enable. Space separated. Can be used
multiple times if required. See below.
-v --verbose : verbose logging

MRTG mode
In MRTG mode, only the first module to provide a metric will be output.

Available modules:
cpu[:n][(warnpc,critpc)] MRTG: user and idle percent (80,90)
mem[ory][:percent][(warnpc,critpc)] MRTG: used and total bytes (or %) (80,90)
temp[erature][(warn,crit)]
fan No MRTG output
psu No MRTG output
health Same as 'cpu mem temp fan psu'
ssl[:server][(activewarn,activecrit)] MRTG: active and total SSL (1400,1000)
traffic[:server] MRTG: bytes in/out. No Nagios.
server:name[(actvw,actvc)] For virt server name (1400,1000)
server:ipaddr[:port][(actvw,actvc)] For virt server ipaddr:port
server[(actvw,actvc)] Over ALL virtual servers (140000,100000)
cert[ificate][:certname][([warn,]crit)] Check certificate days left, no MRTG
conn[ections][(actvw,actvc)] same as 'server'
group Failover health. No MRTG output.
cm Same as 'group'
cache No output; prepare SNMP cache

Server checks:
For server checks, the number of active connections will be thresholded.
Next, the availability of the Server will be checked and will return WARN if
not all the active pool members are available.

If the same module is used multiple times, only the first one will be used.

Examples:
/usr/local/nagios/libexec/check_f5.pl -H myf5 -C public -m 'cpu(80,90) server:/Production/foobar'
/usr/local/nagios/libexec/check_f5.pl -H myf5 -m 'health certificates(14,7) conn(1500,2000)'

Post by **tgriep** » Fri Jul 15, 2016 12:54 pm

You heed to edit the services for the fxtplbdevp1 host and add the missing option

Code: Select all

-C <community>

for all of the SNMP checks and that should resolve the timeouts you are seeing.

aap · Post by **aap** » Fri Jul 15, 2016 1:02 pm

Done.

I will monitor and confirm. Thanks for looking into it.

Post by **tgriep** » Fri Jul 15, 2016 1:03 pm

Your Welcome. Glad to help.

aap · Post by **aap** » Mon Jul 18, 2016 2:41 am

Unfortunately, the issue persists.

I've had 57 false Critical alerts in the last 24 hours because of service check timeouts. I am not experiencing the same problem with all other services though which leans me to believe it may be the check script itself or the device.

Is there any other thing we can check?

Post by **tgriep** » Mon Jul 18, 2016 10:09 am

It looks like the plugin has a default timeout of 10 seconds which might not be long enough in your environment.
To change this to a longer timeout, edit the command in Core Configuration Manager and add the following to it.

Code: Select all

-t 60

That will increase the timeout and hopefully that will fix the issue.

rkennedy · Post by **rkennedy** » Mon Jul 18, 2016 10:13 am

Is the load high on the F5 device by any chance? I wonder if it's just getting overloaded. Are you using this check f5 plugin for any other load balancers? Just trying to narrow it down to see if we can figure out the root cause.

What value are you using currently for the timeout? I've seen SNMP take 2-3 mins before to respond. I wonder if it would work if you increased it to -t 120 or -t 180 as it sounds isolated to this device.

aap · Post by **aap** » Tue Jul 19, 2016 4:50 am

Unfortunately, the highest timeout on the script is 60s. See output below

COMMAND: /usr/local/nagios/libexec/check_f5.pl -H 10.xxx.xxx.xxx -m health -t 180 -C public
OUTPUT: SNMP Error: The timeout value 180 is out of range (1..60)

Nagios Support Forum

Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers

Re: Monitoring F5 Load balancers