Check_load

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
npolite
Posts: 7
Joined: Fri Sep 02, 2016 9:39 am

Check_load

Post by npolite »

Hi All,

Currently I have setup the check_load alerts to alert us when both the CPU/HT hits around 75% of the total cores (since not all hyperthreads are actually a true core). We have hired a small MSP to do some of our DBA work and of course they don't want to look at each alert even though I mentioned to them that we should look for a series of alerts on the same server.

I believe the check_load will alert on either the 1,5,15 (correct me if I am incorrect on this). I really want to be notified after 15 minutes if the load exceeds 75% of the total cores. So how would I set this up? If I were to setup a check for 20 minutes the issue may still have cleared out. What is the best solution to this?

Thanks
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Check_load

Post by mcapra »

Which version of check_load are you using?

Here's the man page for the check_load plugin:
http://nagios-plugins.org/doc/man/check_load.html

Code: Select all

     
     -w, --warning=WLOAD1,WLOAD5,WLOAD15
        Exit with WARNING status if load average exceeds WLOADn
     -c, --critical=CLOAD1,CLOAD5,CLOAD15
        Exit with CRITICAL status if load average exceed CLOADn
        the load average format is the same used by "uptime" and "w"
     -r, --percpu
        Divide the load averages by the number of CPUs (when possible)
npolite wrote:I really want to be notified after 15 minutes if the load exceeds 75% of the total cores. So how would I set this up?
check_load is not a good fit for this use case, then. It requires you to define parameters for the 1,5,15 minute averages -- you'd still need to have some sort of behavior defined for the 1,5 minute averages even if all you really care about is the 15 minute average.

Though as a hacky work-around, I think you can pass "impossible" load levels to check_load for those 1,5 minute averages if you want to ignore them for alerting purposes:

Code: Select all

# ./check_load -w 99,99,1.2 -c 99,99,1.5
OK - load average: 0.03, 0.06, 0.05|load1=0.030;99.000;99.000;0; load5=0.060;99.000;99.000;0; load15=0.050;0.600;0.750;0;
Where 2.0 is 100% CPU usage across the 2 cores on this machine, it's impossible for the 1,5 minute averages to actually ever be 99.
Former Nagios employee
https://www.mcapra.com/
Locked