Load Spikes on 7 Hour Intervals

avandemore · Post by **avandemore** » Thu Dec 01, 2016 5:58 pm

During these restarts, does the information show up in Core?

drcentner · Post by **drcentner** » Thu Dec 01, 2016 6:48 pm

avandemore wrote:have you considered offloading the db?

Considered that, but we try to avoid introducing new dependencies into our production monitoring environment as much as possible. Offloading the DB would introduce additional network and storage dependencies. If we were to offload the DB, could you provide guidance on recommended specs for a DB host for a monitoring environment of our size (2,038 hosts, 14,740 services, 14,000 active checks every 5 minutes)?

avandemore wrote:Also you could consider setting the values in XI > Admin > Performance Settings > Database > NDOUtils to the minimum useful setting.

I have made the following adjustments:

Code: Select all

Nagios XI Database
Reduced Optimize Interval: from 60 to 240

NDOUtils Database
Reduced Max External Commands Age: from 7 to 1
Reduced Max Log Entries Age: from 15 to 3
Reduced Max Notifications Age: from 15 to 3
Reduced Optimize Interval: from 60 to 240

NagiosQL Database
Reduced Max Logbook Age: from 2880 to 480
Reduced Optimize Interval: from 60 to 240

avandemore wrote:Enabling XI > Admin > Performance Settings > Backend Cache may also help, but be sure to understand the ramifications of turning that on. It could easily be unsuitable for your environment.

We require realtime data so this is not an option.

avandemore wrote:Also how many cores are on this system? lscpu |grep -i socket multiply the values.

12

Code: Select all

lscpu |grep -i socket
Core(s) per socket:    6
Socket(s):             2

avandemore wrote:The recommend value for load_threshold in /usr/local/nagios/etc/pnp/npcd.cfg is 4 * the above value.

I increased the load_threshold to 48 as recommended and restarted the npcd service.

avandemore wrote:Also I assume you've got through this document in the past? https://assets.nagios.com/downloads/nag ... ios-XI.pdf

Correct. However, I did review it and made the following additional changes:
Disabled:

Listener For Unconfigured Objects
Subsystem Logging

The changes described above were made yesterday afternoon, and we are still experiencing load spikes. Two today, lasting 1 hour, 20 minutes, and 1 hour, 13 minutes.

avandemore wrote:During these restarts, does the information show up in Core?

We don't ever use the Core UI to monitor, but I just applied a config in CCM while watching Core to test. Here are the results:

Time from receiving the "Configuration applied successfully" message until the Operations Center component stopped displaying alerts for hosts/services that were acknowledged/downtimed: 3 minutes, 10 seconds
The Nagios Core "Host Information" and "Host Status Details For All Host Groups" web pages accurately reflected acknowledgements and downtimes for the entire duration of the 3:10.

Post by **WillemDH** » Fri Dec 02, 2016 10:12 am

As the apply configuration issue is a different problem then the 7 hours spike interval, I suggest further discussion to do in this thread:

https://support.nagios.com/forum/viewto ... figuration

avandemore · Post by **avandemore** » Fri Dec 02, 2016 1:06 pm

drcentner wrote:We don't ever use the Core UI to monitor, but I just applied a config in CCM while watching Core to test. Here are the results:
Time from receiving the "Configuration applied successfully" message until the Operations Center component stopped displaying alerts for hosts/services that were acknowledged/downtimed: 3 minutes, 10 seconds
The Nagios Core "Host Information" and "Host Status Details For All Host Groups" web pages accurately reflected acknowledgements and downtimes for the entire duration of the 3:10.

This info is just for systems that do a LARGE amount of checks. This behavior makes sense. Nagios Core itself isn't dependent on the DB while XI and many of its components are.

Essentially why you are seeing this that during an Apply Config NDO stuff is restarted. NDO startup with a lot of services is extremely disk intensive. This is going to be compounded if your Nagios system is low on RAM. You're already setup with a ramdisk which is tmpfs backed(<--very important point).

I don't have details on if and when your system is swapping/paging but if so the tmpfs backed ramdisk is not helping and even hurting performance to so extent. tmpfs will page to disk under memory pressure if swap space is available. This would likely be evident during your Selenium runs. To avoid this you can disable swap on the system or use ramfs. The latter is definitely not recommended, the former with caveats a Linux Sys Admin could tell you.

In short, in order to increase NDO startup you'll need to make IO really fast, disk and network. Even if you don't want to offload MySQL you could at least leave it on the same system with a difference mount point and it's own set of fast disk. This would also help NDO.

drcentner · Post by **drcentner** » Fri Dec 02, 2016 11:47 pm

The perfdata directory ( /usr/local/nagios/share/perfdata/ ) is mounted from separate physical disks (15K) from the rest of the OS and DB.

There is plenty of available RAM.
The average amount of free physical memory over the past 30 days is 8670MB. The lowest amount of available free physical memory during the same time span was 4951 MB.
Swap is configured to be used if necessary, but is typically unused. The highest amount of swap used over the past 30 days is less than 150MB.
I checked perfdata graphs of the last 3 load spikes and confirmed that 0MB of swap was used during that time.

If we are still experiencing these issues after migrating Selenium to another host, I will look into getting some additional storage for the DB.

avandemore · Post by **avandemore** » Mon Dec 05, 2016 11:00 am

15k disks don't really tell us that much relevant information. What's important is IOPS and if you go through a large Nagios install and profile it, you'll notice an very large amount of IOPS is needed for quick startup. Another way to get more IOPS is a dedicated physical RAMdisk. IO is the startup bottleneck on large installs.

Current Free RAM is also a mostly useless metric, however swap usage is very important. When your system is under memory pressure the first thing getting paged to disk is your tmpfs RAMDisks.

There are several youtube videos on large Nagios installs:

https://www.youtube.com/watch?v=bnws1fYSdKM
https://www.youtube.com/watch?v=LFCArPsIOrg&t=172s
https://www.youtube.com/watch?v=6WlZrG-_sAI&t=8s

drcentner · Post by **drcentner** » Tue Jan 10, 2017 5:12 pm

In spite of all the troubleshooting performed, we were still having the regular spikes as recently as Friday afternoon. We upgraded to 5.4.0 on Friday evening. Since that time, we have not had any load spikes. So it appears that the upgrade resolved our load spikes.

Monday morning, we experienced the behavior described at https://support.nagios.com/forum/viewto ... =6&t=41817. At least one reboot and multiple service restarts occurred, without issue, between the completion of the upgrade Friday and the service failure Monday morning. To resolve, we reverted to the pre-upgrade backup from Friday (5.3.2), upgraded to 5.3.4, then upgraded to 5.4.0. It's been about 28 hours since the revert and incremental upgrade, and the service is still running OK so far, with no load spikes. Hopefully that will continue.

avandemore · Post by **avandemore** » Tue Jan 10, 2017 5:42 pm

Good to hear, we'll keep on eye on the segfault issue. So far we have not be able to reproduce it internally.

Nagios Support Forum

Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals