Page 1 of 1

threshold values for CPU Load Check

Posted: Thu Dec 20, 2018 8:08 am
by vlakshman
Team,

I am trying to set warn and critical CPU Load threshold values for Nagios server (Which is currently polling some 1000 services at polling interval of 1 minute) using check_load nagios plugin.

Threshold Calculation Formula:
y = c * p /100
where y --> Nagios Value, c --> Number of CPU cores, p --> Percent Threshold limit expected

My Nagios server has 8 processors with each having 8 CPU cores and hyper threading is enabled.
For WARN_Load = 0.8,0.8,0.75 and CRIT_Load = 0.9, 0.9, 0.85 I calculated Load limits as -w 51.2, 51.2, 48 -c 57.6, 57.6 54.4
But still Load gets Critical!

Any thoughts on how to handle would be highly appreciated!

Re: threshold values for CPU Load Check

Posted: Thu Dec 20, 2018 10:41 am
by bolson
Hello vlakshman,

It appears that you are entering the warning and critical thresholds as decimal numbers but the check command is expecting an integer.

Ie: 0.9 instead of 90. Try running the service check with integer percent values and see if you get the expected result.

Thank you for visiting the Nagios Support Forum.

Re: threshold values for CPU Load Check

Posted: Mon Dec 24, 2018 9:21 am
by vlakshman
Hi bolson,

Thanks for your feedback.

I am using check_load Nagios plugin for checking CPU Load.
Manual page Looks like it supports both integer and decimal values (I can see results when specifying integer or float value)
https://nagios-plugins.org/doc/man/check_load.html

Following the link below, I understand the following:

https://support.nagios.com/kb/article/l ... s-771.html

1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.
2) If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.

Am using a c5.xlarge EC2 instance which has 8GB and vCPUs with hyper thread enabled.
https://www.ec2instances.info/?filter=c ... =c5.xlarge

(sudo cat /proc/cpuinfo say there are processor:0,1,2,3 and each has 2 cpu cores)

Check_load plugin vs uptime result mismatch:
Following is the threshold set for 90% WARN (1,5 and 15 min) and 95% CRITICAL (1,5 and 15 min).

Code: Select all

/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95
Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;

Code: Select all

 uptime
Output: 14:10:06 up 2 days, 23:03, 1 user, load average: 5.98, 5.67, 5.65

Questions:

1) check_load doesn't match with uptime result
2) Am I setting the right threshold?
3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)

Re: threshold values for CPU Load Check

Posted: Wed Dec 26, 2018 5:06 pm
by bolson
Hello vlakshman,

To your question 1, I would suggest that you load average and uptime match as closely as one would expect if the checks aren't being performed at precisely the same time. And as is the in your example... if the 1 minute average is down, the 5 and 15 minute would also be down, but by a smaller amount.

To question 2, the "correct" thresholds are based on what you expect the load averages to be on your host. This can best be determined by comparing the load average to a cpu utilization check with a frequent interval, ie: 1 minute. Additionally, there is a wealth of information on Linux load average on the internet. I've included a link to my favorite document on the subject.

https://www.teamquest.com/import/pdfs/w ... ldavg1.pdf

To question 3, the four values for each average are 1) value returned by the check, 2) warning threshold, 3) critical threshold, 4) meaningless and can be ignored.

Let us know if thi answers your questions on this topic. Thank you!

Re: threshold values for CPU Load Check

Posted: Wed Dec 26, 2018 5:10 pm
by npolovenko
@vlakshman,
1) check_load doesn't match with uptime result
These outputs look almost identical:
/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95
OK - load average: 6.06, 5.68, 5.66
uptime
load average: 5.98, 5.67, 5.65
1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.
1. Ideally, yes. Because for 1 core CPU threshold of 1 means its functioning on full capacity. But technically the load can go over 1. That means the core is overloaded.
http://blog.scoutapp.com/articles/2009/ ... d-averages

If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.
2. Correct. But normally you should set the threshold calculated based on the number of CPUs on the server. For example, for 4 CPUs the threshold of 4 would mean that all cores are working on a full capacity. You can set the threshold higher then 4 but the server would be already overloaded at that point.
2) Am I setting the right threshold?
-w 90,90,90 -c 95,95,95
For 8 core CPU i'd do -w 7,7,7 -c 8,8,8

3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
In this output:
Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;
Everything after the | sign is used internally by Nagios to build performance graphs. Once you import this check in the XI you will not be able to see values after the "|".