Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
fleish wrote:FWIW - I experienced similar behavior when upgrading from 4.4.3 -> 4.4.5. Downgrading back to 4.4.3 fixed it before I found this thread: https://i.imgur.com/SOUtJmX.jpg
The graph looks pretty similar to mine, even the peaks and troughs. As long as you apply the fix of setting max_concurrent_checks to 15 or whatever value allows your checks to be spread evenly, you should be fine on 4.4.5. I've had no problems with spikes after doing that. The problem should only come back if you stop Nagios for more than 5 mins causing the checks to bunch up again.
I see you have livestatus enabled, I'm not sure if it could be causing any issue, but would it be possible to disable the livestatus module in the nagios.cfg to see if the problem persists?
scottwilkerson wrote:I see you have livestatus enabled, I'm not sure if it could be causing any issue, but would it be possible to disable the livestatus module in the nagios.cfg to see if the problem persists?
I disabled 'livestatus' and tested again. Same issue, 80% of checks rescheduled to run at the same time and then spaced over 8 seconds after that. Very high load recorded as usual.
I'm on the move at the moment due to the holidays but will get this tested tomorrow evening and let you know.
regards,
Aidan
I've finally got round to testing the auto rescheduling options. I set them as per above and this has resolved the issue. I tested as before and stopped Nagios for over 5 minutes to let all the checks bunch up. After starting Nagios it was showing the usual 80% of checks scheduled to run in the same second. However, about 30-40 seconds before they were due to run, the auto rescheduling kicked in and spread them out evenly over the next 5 minutes avoiding the huge CPU spike.
I have left Nagios running with the auto rescheduling options in place and will let you know if I notice any performance hit. Host and service check latency is low so it looks like it is working fine.
aanderson wrote:I've finally got round to testing the auto rescheduling options. I set them as per above and this has resolved the issue. I tested as before and stopped Nagios for over 5 minutes to let all the checks bunch up. After starting Nagios it was showing the usual 80% of checks scheduled to run in the same second. However, about 30-40 seconds before they were due to run, the auto rescheduling kicked in and spread them out evenly over the next 5 minutes avoiding the huge CPU spike.
I have left Nagios running with the auto rescheduling options in place and will let you know if I notice any performance hit. Host and service check latency is low so it looks like it is working fine.