Request Advice on monitoring CPU load
Posted: Sun Jan 18, 2015 6:16 pm
Hello!
I have recently just started using Nagios Core (management won't fund XI, at least not yet) and will probably be asking a lot of advice questions over the next few days. I apologize in advance and make my best attempt to search for answers beforehand. So far it works pretty well, but there are a few details I am trying to work on.
I am using the cpu_load check via nrpe on many servers. What I don't understand is working with the 5min, 10min, and 15 min averages all at the same time. I feel like I only need to work with the 5 min average, but at the same time feel like the other averages would not be part of the check without good reason.
For me I see it as either the 5 minute will be above the 15 minute and thus go critical anyways. Or the 5 minute will be below the 15 minute and thus the load is going down so I don't need it to be considered critical. For right now I set the 10 and 15 minute thresholds to 999 to that only the 5 minute average is used to determine status.
Even though this seems to work for my needs, I fear that I may be missing out on something here by circumventing the 10/15 min average readings. Can anyone offer any advice on this? Perhaps there might be situations where this gets me in trouble? I am curious how most people use this since without doing what I am doing I sometimes get a notification when the load is dropping, but if I use more iterations before notification I worry it could be too late in some situations.
Thanks for any advice!
I have recently just started using Nagios Core (management won't fund XI, at least not yet) and will probably be asking a lot of advice questions over the next few days. I apologize in advance and make my best attempt to search for answers beforehand. So far it works pretty well, but there are a few details I am trying to work on.
I am using the cpu_load check via nrpe on many servers. What I don't understand is working with the 5min, 10min, and 15 min averages all at the same time. I feel like I only need to work with the 5 min average, but at the same time feel like the other averages would not be part of the check without good reason.
For me I see it as either the 5 minute will be above the 15 minute and thus go critical anyways. Or the 5 minute will be below the 15 minute and thus the load is going down so I don't need it to be considered critical. For right now I set the 10 and 15 minute thresholds to 999 to that only the 5 minute average is used to determine status.
Even though this seems to work for my needs, I fear that I may be missing out on something here by circumventing the 10/15 min average readings. Can anyone offer any advice on this? Perhaps there might be situations where this gets me in trouble? I am curious how most people use this since without doing what I am doing I sometimes get a notification when the load is dropping, but if I use more iterations before notification I worry it could be too late in some situations.
Thanks for any advice!