Nagios reported false load values

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
gregg_hughes_ISC
Posts: 18
Joined: Fri Aug 08, 2014 3:03 pm

Nagios reported false load values

Post by gregg_hughes_ISC »

Hello, all!

We had an interesting thing happen with Nagios recently. We have Nagios Core 3.2.3 monitoring a number of Linux servers distributed between virtual and physical servers, and in two separate datacenters.

After doing some maintenance on the server, Nagios suddenly reported critically high load values for all the servers it tracks - physical and virtual, both datacenters. We checked the actual servers and there were no problems with server loads. The loads came back down as Nagios polled the servers and replaced the bogus loads with real ones.

What would cause Nagios to show all these values? Since they clearly did not exist on the actual machines, I can only conclude that the maintenance in some way affected Nagios. However, the maintenance was in relation to a completely different user and application.

Any ideas on what would cause Nagios to panic like this?

Thanks to all!
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Nagios reported false load values

Post by eloyd »

How are you monitoring the load? Passive checks? Active checks? NRPE? Something else?

I'm guessing that you looked at load with the "uptime" command. This shows average load over time. Nagios's check_load also checks load averages. Depending on how you are actually obtaining the load, your instantaneous load average may be quite higher. This is a short-term spike in CPU usage that some tools will see because they are looking at current load, not averaged over time.

This may have been the result of your maintenance, it may not. It's very hard to tell without knowing what was done and how you're monitoring.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
gregg_hughes_ISC
Posts: 18
Joined: Fri Aug 08, 2014 3:03 pm

Re: Nagios reported false load values

Post by gregg_hughes_ISC »

Hello, Eric!

I'm just using the plain-vanilla check_load command in Nagios. I checked against the monitored servers with top, so I did get an average load. The loads were 0.XX and a couple 1.0X on the actual servers.
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Nagios reported false load values

Post by eloyd »

Are you checking with passive checks or some sort of active check (NRPE, check_by_ssh, something else)?

And again, without knowing the nature of the maintenance or what your load results were, it's very hard to offer any advice.

Regardless, I'm guessing that what you saw was very transient as a result of your maintenance and "if it ain't broke" now, then it's not going to be easy to fix, either. :-)
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
Locked