Nagios 4 Load issues

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Wilb
Posts: 9
Joined: Wed Oct 19, 2016 3:19 am

Nagios 4 Load issues

Post by Wilb »

Moderator Edit: This thread has been split from another - https://support.nagios.com/forum/viewtopic.php?t=27068
In the future, please create a new thread and link to the old one instead of adding on.


Sorry to bump an old thread, but I'm experiencing very similar behaviour on some Nagios hosts I've recently built. Initially I built a RHEL7 host using Nagios 4.0.8 that comes with the distro, but I've since built an identical instance using 4.2.1 from source in hope that it may have been due to a bug that has been fixed, but it's also showing the same behaviour. In fact all hosts I've built across test, stage and our production environments are showing the same regular spikey load patterns. The hosts seem to perform fine even when exhibiting these high load spikes.

The auto rescheduler is already disabled:

$ grep auto nagios.cfg
auto_reschedule_checks=0
auto_rescheduling_interval=30
auto_rescheduling_window=180

I've attached a sample of load graphs from my hosts. The busiest of these servers has around 140 hosts with just shy of 4000 service checks.

I note the following support article which refers to nom checkpoints, however this seems to be a Nagiox XI thing (but it does refer to Nagios 4 Core) https://support.nagios.com/kb/article.php?id=150
Attachments
4.0.8_3
4.0.8_3
4.0.8_2
4.0.8_2
4.0.8_1
4.0.8_1
4.0.8.png (10.74 KiB) Viewed 2905 times
Last edited by tmcdonald on Fri Oct 21, 2016 9:09 am, edited 1 time in total.
Reason: Please create a new thread and link to the old one instead of adding on - we try to keep threads to one per user, as this reduces confusion.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Nagios 4 Load issues

Post by rkennedy »

It looks to be coming in waves, do you know what process is using up all the CPU? When it peaks once again, could you run ps -eo pcpu,args --sort=-%cpu and post back the result for us?

4K checks shouldn't cause issues, but depending on what's running in the backend it could just be a few checks causing this. I have seen it happen with a system having quite a few failing SNMP checks.
Former Nagios Employee
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Nagios 4 Load issues

Post by avandemore »

Drilling down into issues like this can be quite tedious, but the first thing to do is isolate where the problem is. First thing I would like is some details about the system and processes during both peak and trough. During these periods can you capture the output of top -bcn1. If needed you can use Event Handlers to automate this.

Also if you are IO bound, utilizing a ramdisk may help you. Utilizing a RAM Disk in Nagios XI. That document states XI but is applicable to Core as well.
Previous Nagios employee
Wilb
Posts: 9
Joined: Wed Oct 19, 2016 3:19 am

Re: Nagios 4 Load issues

Post by Wilb »

I sat and did some analysis last week when one of the boxes was under a load peak and my findings were essentially "not a great deal". They're not iobound, nor are they obviously CPU bound. I wondered if the load was just being skewed heavily due to the sheer volume of checks taking place, but if that was the case you would expect a steadier load pattern over time rather than one that settles back down to 0 as virtually all of these checks run on the same default schedule.

I'll look to grab some process outputs next week and report back. The other thing I've done today is to enable auto_reschedule_checks on the 4.2.1 host just to see how the load pattern looks with it enabled. Interestingly, it looks like this has sat with a slightly increased (but steady) load all afternoon and is currently 0.25, 0.35, 0.34 whereas the 4.0.8 host is currently back at 0.00, 0.04, 0.06 after a lunchtime spike. I'll be interested to see how this progresses over the weekend.
Wilb
Posts: 9
Joined: Wed Oct 19, 2016 3:19 am

Re: Nagios 4 Load issues

Post by Wilb »

Just grabbed a couple of outputs from one of our boxes that's just hitting spike-o-clock. Image attached, outputs to follow.
Attachments
top.png
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Nagios 4 Load issues

Post by dwhitfield »

top -bcn1 will actually show different output. Could you post that? Thanks!
Wilb
Posts: 9
Joined: Wed Oct 19, 2016 3:19 am

Re: Nagios 4 Load issues

Post by Wilb »

Yep I've got more detailed output on machine in the office, I just need to obfuscate any sensitive data before posting :-)
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Nagios 4 Load issues

Post by dwhitfield »

Wilb wrote:Yep I've got more detailed output on machine in the office, I just need to obfuscate any sensitive data before posting :-)
Sounds good. Smart move!
Wilb
Posts: 9
Joined: Wed Oct 19, 2016 3:19 am

Re: Nagios 4 Load issues

Post by Wilb »

Apologies, not had chance to post that data today. However, attached is a screenshot of the load from the 4.2.1 host that I set auto_reschedule_checks=1 on Friday afternoon. The load now looks perfect. I'll try this on the 4.0.8 hosts tomorrow and see how that progresses.
Attachments
Screenshot_2016-10-24_20-15-40.png
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Nagios 4 Load issues

Post by dwhitfield »

Good to hear things are working on 4.2.1. We await word on 4.0.8.
Locked