Request Advice on monitoring CPU load

ocolin · Post by **ocolin** » Sun Jan 18, 2015 6:16 pm

Hello!

I have recently just started using Nagios Core (management won't fund XI, at least not yet) and will probably be asking a lot of advice questions over the next few days. I apologize in advance and make my best attempt to search for answers beforehand. So far it works pretty well, but there are a few details I am trying to work on.

I am using the cpu_load check via nrpe on many servers. What I don't understand is working with the 5min, 10min, and 15 min averages all at the same time. I feel like I only need to work with the 5 min average, but at the same time feel like the other averages would not be part of the check without good reason.

For me I see it as either the 5 minute will be above the 15 minute and thus go critical anyways. Or the 5 minute will be below the 15 minute and thus the load is going down so I don't need it to be considered critical. For right now I set the 10 and 15 minute thresholds to 999 to that only the 5 minute average is used to determine status.

Even though this seems to work for my needs, I fear that I may be missing out on something here by circumventing the 10/15 min average readings. Can anyone offer any advice on this? Perhaps there might be situations where this gets me in trouble? I am curious how most people use this since without doing what I am doing I sometimes get a notification when the load is dropping, but if I use more iterations before notification I worry it could be too late in some situations.

Thanks for any advice!

slansing · Post by **slansing** » Mon Jan 19, 2015 4:59 pm

Well, in truth it is likely quite a bit simpler than that. I believe the main reason for including those values in the check is because that is what the standard admin expects to see when they look at the system's load through an application such as TOP. So, realistically, you could ignore whatever threshold you wish if you do not feel it is necessary for you to keep an eye on.

ocolin · Post by **ocolin** » Mon Jan 19, 2015 6:18 pm

That's kind of what I was thinking, but worry that my inexperience my come to bite me with things I don't know to look for. So far with the other two values being ignored it hasn't been a problem, but I will still need to test for longer to be sure. Thank you!

tmcdonald · Post by **tmcdonald** » Tue Jan 20, 2015 10:47 am

It also has to do with "averaging the averages". If you have a "10, 8, 6" critical load threshold for 5-min, 5-min, and 15-min respectively, then a load of 9 for 5 minutes may not pose a real problem since backups might be running. However that same load of 9 for 10 or 15 minutes could indicate a problem.

Nagios Support Forum

Request Advice on monitoring CPU load

Request Advice on monitoring CPU load

Re: Request Advice on monitoring CPU load

Re: Request Advice on monitoring CPU load

Re: Request Advice on monitoring CPU load