Page 1 of 1

CPU_load

Posted: Mon Jan 15, 2024 11:33 am
by kabamaru
Hello

I'm puzzled by the following.

I'm trying to monitor the CPU load of a server and get a warning when the average CPU load is over 80% and critical when it's over 90%. To do this I added this line on the host's nrpe.cfg

command[check_load]=/usr/local/nagios/libexec/check_load -r -w 38.4,38.4,38.4 -c 43.2,43.2,43.2

I got these values using the y = c * p / 100 formula. Where c is the number of CPUs and p the desired threshold percentage.
The server has 48 CPUs, so if I want the value for 80%, I would use 48*(8/10)=38.4

Nagios is reporting this:

CPU OK 01-15-2024 16:19:27 0d 1h 7m 11s 1/3 OK - load average per CPU: 1.02, 1.02, 1.02

But when I do uptime on the server I can see that the load average is load average: 49.65, 49.12, 48.91

I'm confused. Something I don't fully understand here. Shouldn't the load average match the one displayed by uptime?

Many thanks for you help.

Re: CPU_load

Posted: Thu Jan 18, 2024 11:36 am
by jsimon
Hi @kabamaru,

Looking at the documentation for the check_load plugin, it looks like the "-r" parameter you are using tells the command to "Divide the load averages by the number of CPUs (when possible)" -- Given that you have 48 CPUs, is it possible that would account for the discrepancy?

Re: CPU_load

Posted: Wed Feb 07, 2024 10:49 am
by kabamaru
Hi jsimon

Thank you for your time. You are right.
I have removed the -r and I can see that the values nagios is reporting are very similar to the ones that uptime outputs.
They can't be exactly the same because they are constantly changing.

Thank you so much for your help.
All the best

Re: CPU_load

Posted: Mon Feb 12, 2024 12:01 pm
by jsimon
I'm glad we were able to figure that out for you! I'll go ahead and lock the issue then.