threshold values for CPU Load Check

Engage with the community of users including those using the open source solutions.
Includes Nagios Core, Plugins, and NCPA

threshold values for CPU Load Check

Postby vlakshman » Thu Dec 20, 2018 8:08 am

Team,

I am trying to set warn and critical CPU Load threshold values for Nagios server (Which is currently polling some 1000 services at polling interval of 1 minute) using check_load nagios plugin.

Threshold Calculation Formula:
y = c * p /100
where y --> Nagios Value, c --> Number of CPU cores, p --> Percent Threshold limit expected

My Nagios server has 8 processors with each having 8 CPU cores and hyper threading is enabled.
For WARN_Load = 0.8,0.8,0.75 and CRIT_Load = 0.9, 0.9, 0.85 I calculated Load limits as -w 51.2, 51.2, 48 -c 57.6, 57.6 54.4
But still Load gets Critical!

Any thoughts on how to handle would be highly appreciated!
vlakshman
 
Posts: 27
Joined: Tue Aug 21, 2018 11:03 am

Re: threshold values for CPU Load Check

Postby bolson » Thu Dec 20, 2018 10:41 am

Hello vlakshman,

It appears that you are entering the warning and critical thresholds as decimal numbers but the check command is expecting an integer.

Ie: 0.9 instead of 90. Try running the service check with integer percent values and see if you get the expected result.

Thank you for visiting the Nagios Support Forum.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Brian Olson
Nagios Support Team
---
Nagios Enterprises, LLC
Email: customersupport@nagios.com
Web: https://www.nagios.com/
bolson
The Closer
 
Posts: 777
Joined: Tue Jul 11, 2017 10:34 am

Re: threshold values for CPU Load Check

Postby vlakshman » Mon Dec 24, 2018 9:21 am

Hi bolson,

Thanks for your feedback.

I am using check_load Nagios plugin for checking CPU Load.
Manual page Looks like it supports both integer and decimal values (I can see results when specifying integer or float value)
https://nagios-plugins.org/doc/man/check_load.html

Following the link below, I understand the following:

https://support.nagios.com/kb/article/load-checks-771.html

1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.
2) If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.

Am using a c5.xlarge EC2 instance which has 8GB and vCPUs with hyper thread enabled.
https://www.ec2instances.info/?filter=c5&selected=c5.xlarge

(sudo cat /proc/cpuinfo say there are processor:0,1,2,3 and each has 2 cpu cores)

Check_load plugin vs uptime result mismatch:
Following is the threshold set for 90% WARN (1,5 and 15 min) and 95% CRITICAL (1,5 and 15 min).

Code: Select all
/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95

Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;

Code: Select all
uptime

Output: 14:10:06 up 2 days, 23:03, 1 user, load average: 5.98, 5.67, 5.65

Questions:

1) check_load doesn't match with uptime result
2) Am I setting the right threshold?
3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
vlakshman
 
Posts: 27
Joined: Tue Aug 21, 2018 11:03 am

Re: threshold values for CPU Load Check

Postby bolson » Wed Dec 26, 2018 5:06 pm

Hello vlakshman,

To your question 1, I would suggest that you load average and uptime match as closely as one would expect if the checks aren't being performed at precisely the same time. And as is the in your example... if the 1 minute average is down, the 5 and 15 minute would also be down, but by a smaller amount.

To question 2, the "correct" thresholds are based on what you expect the load averages to be on your host. This can best be determined by comparing the load average to a cpu utilization check with a frequent interval, ie: 1 minute. Additionally, there is a wealth of information on Linux load average on the internet. I've included a link to my favorite document on the subject.

https://www.teamquest.com/import/pdfs/whitepaper/ldavg1.pdf

To question 3, the four values for each average are 1) value returned by the check, 2) warning threshold, 3) critical threshold, 4) meaningless and can be ignored.

Let us know if thi answers your questions on this topic. Thank you!
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Brian Olson
Nagios Support Team
---
Nagios Enterprises, LLC
Email: customersupport@nagios.com
Web: https://www.nagios.com/
bolson
The Closer
 
Posts: 777
Joined: Tue Jul 11, 2017 10:34 am

Re: threshold values for CPU Load Check

Postby npolovenko » Wed Dec 26, 2018 5:10 pm

@vlakshman,
1) check_load doesn't match with uptime result

These outputs look almost identical:
/usr/lib64/nagios/plugins/check_load -w 90,90,90 -c 95,95,95
OK - load average: 6.06, 5.68, 5.66

uptime
load average: 5.98, 5.67, 5.65


1)If we are checking CPU load for every CPU core (in a multi-core environment) the threshold will fall between 0 to 1.

1. Ideally, yes. Because for 1 core CPU threshold of 1 means its functioning on full capacity. But technically the load can go over 1. That means the core is overloaded.
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages


If we want to set threshold for entire server's CPU load, then threshold can fall between to 0 to infinity.

2. Correct. But normally you should set the threshold calculated based on the number of CPUs on the server. For example, for 4 CPUs the threshold of 4 would mean that all cores are working on a full capacity. You can set the threshold higher then 4 but the server would be already overloaded at that point.

2) Am I setting the right threshold?
-w 90,90,90 -c 95,95,95

For 8 core CPU i'd do -w 7,7,7 -c 8,8,8

3) What does the 4 samples seen in load1,load5 and load15 avg mean ?? May be the 4 samples of them ??)
In this output:
Output: OK - load average: 6.06, 5.68, 5.66|load1=6.060;90.000;95.000;0; load5=5.680;90.000;95.000;0; load15=5.660;90.000;95.000;0;

Everything after the | sign is used internally by Nagios to build performance graphs. Once you import this check in the XI you will not be able to see values after the "|".
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
npolovenko
Support Tech
 
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm


Return to Community Support

Who is online

Users browsing this forum: No registered users and 19 guests