Nagios reported false load values

gregg_hughes_ISC · Post by **gregg_hughes_ISC** » Mon Aug 11, 2014 10:16 am

Hello, all!

We had an interesting thing happen with Nagios recently. We have Nagios Core 3.2.3 monitoring a number of Linux servers distributed between virtual and physical servers, and in two separate datacenters.

After doing some maintenance on the server, Nagios suddenly reported critically high load values for all the servers it tracks - physical and virtual, both datacenters. We checked the actual servers and there were no problems with server loads. The loads came back down as Nagios polled the servers and replaced the bogus loads with real ones.

What would cause Nagios to show all these values? Since they clearly did not exist on the actual machines, I can only conclude that the maintenance in some way affected Nagios. However, the maintenance was in relation to a completely different user and application.

Any ideas on what would cause Nagios to panic like this?

Thanks to all!

Post by **eloyd** » Mon Aug 11, 2014 10:26 am

How are you monitoring the load? Passive checks? Active checks? NRPE? Something else?

I'm guessing that you looked at load with the "uptime" command. This shows average load over time. Nagios's check_load also checks load averages. Depending on how you are actually obtaining the load, your instantaneous load average may be quite higher. This is a short-term spike in CPU usage that some tools will see because they are looking at current load, not averaged over time.

This may have been the result of your maintenance, it may not. It's very hard to tell without knowing what was done and how you're monitoring.

gregg_hughes_ISC · Post by **gregg_hughes_ISC** » Mon Aug 11, 2014 11:02 am

Hello, Eric!

I'm just using the plain-vanilla check_load command in Nagios. I checked against the monitored servers with top, so I did get an average load. The loads were 0.XX and a couple 1.0X on the actual servers.

Post by **eloyd** » Mon Aug 11, 2014 11:34 am

Are you checking with passive checks or some sort of active check (NRPE, check_by_ssh, something else)?

And again, without knowing the nature of the maintenance or what your load results were, it's very hard to offer any advice.

Regardless, I'm guessing that what you saw was very transient as a result of your maintenance and "if it ain't broke" now, then it's not going to be easy to fix, either.

Nagios Support Forum

Nagios reported false load values

Nagios reported false load values

Re: Nagios reported false load values

Re: Nagios reported false load values

Re: Nagios reported false load values