Hi All,
Currently I have setup the check_load alerts to alert us when both the CPU/HT hits around 75% of the total cores (since not all hyperthreads are actually a true core). We have hired a small MSP to do some of our DBA work and of course they don't want to look at each alert even though I mentioned to them that we should look for a series of alerts on the same server.
I believe the check_load will alert on either the 1,5,15 (correct me if I am incorrect on this). I really want to be notified after 15 minutes if the load exceeds 75% of the total cores. So how would I set this up? If I were to setup a check for 20 minutes the issue may still have cleared out. What is the best solution to this?
Thanks
Check_load
Re: Check_load
Which version of check_load are you using?
Here's the man page for the check_load plugin:
http://nagios-plugins.org/doc/man/check_load.html
Though as a hacky work-around, I think you can pass "impossible" load levels to check_load for those 1,5 minute averages if you want to ignore them for alerting purposes:
Where 2.0 is 100% CPU usage across the 2 cores on this machine, it's impossible for the 1,5 minute averages to actually ever be 99.
Here's the man page for the check_load plugin:
http://nagios-plugins.org/doc/man/check_load.html
Code: Select all
-w, --warning=WLOAD1,WLOAD5,WLOAD15
Exit with WARNING status if load average exceeds WLOADn
-c, --critical=CLOAD1,CLOAD5,CLOAD15
Exit with CRITICAL status if load average exceed CLOADn
the load average format is the same used by "uptime" and "w"
-r, --percpu
Divide the load averages by the number of CPUs (when possible)
check_load is not a good fit for this use case, then. It requires you to define parameters for the 1,5,15 minute averages -- you'd still need to have some sort of behavior defined for the 1,5 minute averages even if all you really care about is the 15 minute average.npolite wrote:I really want to be notified after 15 minutes if the load exceeds 75% of the total cores. So how would I set this up?
Though as a hacky work-around, I think you can pass "impossible" load levels to check_load for those 1,5 minute averages if you want to ignore them for alerting purposes:
Code: Select all
# ./check_load -w 99,99,1.2 -c 99,99,1.5
OK - load average: 0.03, 0.06, 0.05|load1=0.030;99.000;99.000;0; load5=0.060;99.000;99.000;0; load15=0.050;0.600;0.750;0;
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/