Hello, all!
We had an interesting thing happen with Nagios recently. We have Nagios Core 3.2.3 monitoring a number of Linux servers distributed between virtual and physical servers, and in two separate datacenters.
After doing some maintenance on the server, Nagios suddenly reported critically high load values for all the servers it tracks - physical and virtual, both datacenters. We checked the actual servers and there were no problems with server loads. The loads came back down as Nagios polled the servers and replaced the bogus loads with real ones.
What would cause Nagios to show all these values? Since they clearly did not exist on the actual machines, I can only conclude that the maintenance in some way affected Nagios. However, the maintenance was in relation to a completely different user and application.
Any ideas on what would cause Nagios to panic like this?
Thanks to all!
Nagios reported false load values
Re: Nagios reported false load values
How are you monitoring the load? Passive checks? Active checks? NRPE? Something else?
I'm guessing that you looked at load with the "uptime" command. This shows average load over time. Nagios's check_load also checks load averages. Depending on how you are actually obtaining the load, your instantaneous load average may be quite higher. This is a short-term spike in CPU usage that some tools will see because they are looking at current load, not averaged over time.
This may have been the result of your maintenance, it may not. It's very hard to tell without knowing what was done and how you're monitoring.
I'm guessing that you looked at load with the "uptime" command. This shows average load over time. Nagios's check_load also checks load averages. Depending on how you are actually obtaining the load, your instantaneous load average may be quite higher. This is a short-term spike in CPU usage that some tools will see because they are looking at current load, not averaged over time.
This may have been the result of your maintenance, it may not. It's very hard to tell without knowing what was done and how you're monitoring.
-
- Posts: 18
- Joined: Fri Aug 08, 2014 3:03 pm
Re: Nagios reported false load values
Hello, Eric!
I'm just using the plain-vanilla check_load command in Nagios. I checked against the monitored servers with top, so I did get an average load. The loads were 0.XX and a couple 1.0X on the actual servers.
I'm just using the plain-vanilla check_load command in Nagios. I checked against the monitored servers with top, so I did get an average load. The loads were 0.XX and a couple 1.0X on the actual servers.
Re: Nagios reported false load values
Are you checking with passive checks or some sort of active check (NRPE, check_by_ssh, something else)?
And again, without knowing the nature of the maintenance or what your load results were, it's very hard to offer any advice.
Regardless, I'm guessing that what you saw was very transient as a result of your maintenance and "if it ain't broke" now, then it's not going to be easy to fix, either.
And again, without knowing the nature of the maintenance or what your load results were, it's very hard to offer any advice.
Regardless, I'm guessing that what you saw was very transient as a result of your maintenance and "if it ain't broke" now, then it's not going to be easy to fix, either.