Nagios 4 Load issues

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Nagios 4 Load issues

Post by liquidcool »

All,

At the moment I am doing a project to upgrade all our nagios hosts from 3.5 to 4.0.6.

I have created a new instance of nagios 4.0.6 and set it up to run exactly the same checks as the server it will be replacing. Only difference being the new server runs the checks using python while the old 3.x server runs the checks in perl.

The specs of each server are exactly the same. Quad Processor with 4GB RAM (Both VM's are in the same farm with the same Processor Specs). There is no contention on the VM Farm at all.

I have noticed a massive difference between the two instances. the 3.5 server runs an extremely low load average. and CPU util sits constant at 5%. while on the other hand the 4.0.6 server runs a lot higher and more erratic. Let me explain :
the 4.0.6 server CPU runs at a constant average of 15%. Its load average respectively is also higher. But what is even more disturbing is that after the service has been running for a couple of hours it load starts spiking every 1 and half to 2 hours. I have uploaded a screen shot of the the load graph. In the graph you can see the start when I initiated the service about 4/5 pm. It is stable for a while and then goes doolally.

The following tweaks are on both servers :
use_retained_scheduling_info = 0
use_large_installation_tweaks = 1
max_service_check_spread = 60
max_host_check_spread = 60

Both servers are running about 3100 checks over 150 hosts. So really not a lot. Actually the newer server is running less checks as we have refined what needs to be checked. There are no checks with a check interval of and hour or more. All checks are between 1 and 10 minute check intervals.

Has anyone seen this type of behavior before ? Is there something I can try or look at to give me a better idea of what is happening, because I am not going to agree to this going into production to the business until this is resolved.

Thanks
Attachments
4.0.6 Load
4.0.6 Load
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios 4 Load issues

Post by abrist »

This is actually counter-intuitive and in contention with the results we have seen on our internal boxes. Core 4 should offer better performance.
You mentioned that you moved a bunch of checks from perl plugins to python ones. Could this be the cause of the load spike? Maybe there are issues with the new plugins on the core 4 server?
If you run top on both systems, what are the top 3 or 4 processes with the most cpu time?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Nagios 4 Load issues

Post by slansing »

Have you tried any other Core 4 versions or just 4.0.6? It sounds like this is just a bare Core server correct? With the minimally required packages/dependencies/plugins installed? No tweaks beyond that such as in house custom VM images?

Could you grab the output of "TOP" when you hit one of these spikes? And do the spikes stay fairly consistent in their load range after a while, or are the continually growing even beyond the point of the graph you showed?
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios 4 Load issues

Post by tmcdonald »

Can you clarify one point? When you say the old server used perl and the new server uses python, do you mean you converted all the perl plugins to their python counterparts? And was the old server using the embedded perl option?
Former Nagios employee
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

tmcdonald:
Correct. I have re-written all the perl scripts to python. The previous version did have the embedded perl option

slansing:
I have just gone and installed 4.0.5 to see if the issue is in that also. It is just a bare Core server, pnp4nagios, and nrpe. No tweaks besides the ones I mentioned.
I could grab top, but CPU does not spike, there are no processes that are hogging CPU (All the ones at the top are either 0% or 1% usage). There is no swap usage. Disk activity looks constant before, during and after the spike, so no disk related issues there (As I would expect, as this is a SAN attached device and if this had issues I would see issues across the board for all my VM's). The load stays constant, does not grow. Fluctuates but does not grow.

abrist:
I agree with you. I also read the specs of Core 4 and I agree it should offer better performance, but I am seeing a noticeable difference in the CPU averages. 5% compared to 15%. Maybe not having the embedded perl from 3.x to 4 could do it. That does not worry me too much. I can't see how the python scripts would cause the load spikes. If they were badly scripted (Which I am confident they are not) then I would see a constant high(ish) load going on. But the fact that the load can go down to almost 0 (and all the scripts/checks are still running) and then suddenly spike every 1 hour and 45 minutes (pretty much timed them now) makes me think they are not the scripts as there are no checks that have that interval. Nagios should spread the checks out, which I am confident it is doing, and putting in max_service_check_spread = 60 and max_host_check_spread = 60 would give it more to work with in terms of timing. If I run top there is nothing specific that is taking the top 3 spaces. You see nagios in there a little. you see a check script here or there. maybe even the http daemon, but they are all at 0% or 1%. Nothing is constantly smacking the CPU.
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

A little update on this.

I think I may have found the cause of the higher CPU usage.

As a result of using python I am also using the netsnmp python module. When you compile it, it creates a python egg file as the module. This file is basically a zip file. So what is happening is that every time it runs a check it has to unzip this file, put it into a cache folder and use the modules in that cache folder to run the check.

What I did was unzip the egg file and renamed the original egg file to something else. As soon as I did that the CPU % util has dropped by over half. so now almost comparable with 3.x. though slightly higher, but I think I can accept that.

Still waiting on whether or not the load spikes after an hour and 45 minutes. If it does not I will recompile 4.0.6 and install that and see what happens then.

Will keep you posted.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios 4 Load issues

Post by tmcdonald »

Thanks for the update. Being more of a perl guy myself I don't know a lot about the python stack, but that makes sense. Looking forward to your results.
Former Nagios employee
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

OK, here is a further update.

After finding that little issue with the netsnmp python module, I am still seeing load spikes every 1 hour and 45 minutes. Though these spikes are on not as high as before. Probably getting a load average spike of about 2/3. I have built a duplicate system at our office in the US. They are running pretty much the same setup. Virtualized and all. I am now seeing the exact same issue there. Though there the load spike to over 50. The only difference being that they are running SNMPv3 while we use SNMPv2. I have uploaded a recent graph.

Has anyone got any ideas or suggestions I can try to get this working ?

Thanks
Attachments
US Nag Load
US Nag Load
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios 4 Load issues

Post by abrist »

Have you identified the process that is the main contributor to the load spike?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
liquidcool
Posts: 59
Joined: Tue Feb 21, 2012 6:08 am

Re: Nagios 4 Load issues

Post by liquidcool »

No I have not. And I am seeing that this is more of a Version 4 issue. I installed 3.5.1 and ran the same checks. Pretty much the same config, and I DON'T get these spikes at all. Load stays constant (fluctuates a little as you would expect), but overall no massive spikes like these.

There is definitely something that version 4 is doing. Maybe a cleanup, i don't know really. Maybe there is an issue with how it works with Python 2.6.8.

I don't know where to go from here, besides downgrade everything and stick with 3.5.1 as that seems a lot more stable.

If someone at nagios has seen this post I would be more than willing to help out in finding the root cause of this.
Locked