Load Spikes on 7 Hour Intervals

drcentner · Post by **drcentner** » Tue Nov 15, 2016 4:26 pm

avandemore wrote:You should be able to resolve this part by setting retain_status_information=1:

It appears that we already have that set unless I'm missing something:

grep -r retain_status_information /usr/local/nagios/etc/
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/servicetemplates.cfg:       retain_status_information                     1
/usr/local/nagios/etc/servicetemplates.cfg:       retain_status_information                     1

Code: Select all

mysql> SELECT count(*) FROM nagios.nagios_hosts WHERE retain_status_information = 1;
+----------+
| count(*) |
+----------+
|     2010 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM nagios.nagios_hosts WHERE retain_status_information = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM nagios.nagios_services WHERE retain_status_information = 1;
+----------+
| count(*) |
+----------+
|    14543 |
+----------+
1 row in set (0.02 sec)

mysql> SELECT count(*) FROM nagios.nagios_services WHERE retain_status_information = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.03 sec)

Does anything need to be changed/modified in /usr/local/nagios/etc/nagios.cfg ?

Code: Select all

grep retain /usr/local/nagios/etc/nagios.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
use_retained_program_state=1
use_retained_scheduling_info=1

avandemore · Post by **avandemore** » Tue Nov 15, 2016 5:04 pm

Is state_retention_file defined?

Also pretty sure setting things like retained_host_attribute_mask=0 would be better off to be left as default eg not present in the configuration.

EDIT: Upon further checking that setting is fine.

drcentner · Post by **drcentner** » Tue Nov 15, 2016 5:13 pm

avandemore wrote:Is state_retention_file defined?

Yes

Code: Select all

grep retention /usr/local/nagios/etc/nagios.cfg
retention_update_interval=60
state_retention_file=/usr/local/nagios/var/retention.dat

avandemore wrote:Also pretty sure setting things like retained_host_attribute_mask=0 would be better off to be left as default.

I have commented out all the definitions starting with "retained_" and ending in "_mask".

Code: Select all

grep retain /usr/local/nagios/etc/nagios.cfg
#retained_contact_host_attribute_mask=0
#retained_contact_service_attribute_mask=0
#retained_host_attribute_mask=0
#retained_process_host_attribute_mask=0
#retained_process_service_attribute_mask=0
#retained_service_attribute_mask=0
retain_state_information=1
use_retained_program_state=1
use_retained_scheduling_info=1

avandemore · Post by **avandemore** » Tue Nov 15, 2016 5:16 pm

I edited my last post, but seemingly too late.

EDIT: Upon further checking that setting is fine.

drcentner · Post by **drcentner** » Tue Nov 15, 2016 5:28 pm

avandemore wrote:I edited my last post, but seemingly too late.
EDIT: Upon further checking that setting is fine.

No worries. After seeing your initial comment, I read up on those in the Nagios documentation, and it probably is best to leave them commented out, unless you actually recommend enabling them.

avandemore · Post by **avandemore** » Tue Nov 15, 2016 5:45 pm

I'm not sure what's going on here then I will need to some more testing to figure out the retention status. I have a faint memory of there being some bug with it but I can't find it ATM.

drcentner · Post by **drcentner** » Tue Nov 29, 2016 8:14 pm

rkennedy wrote:Do you have a local check running for CPU on the localhost machine? If so, please apply an event handler with this as the contents and this will produce a log file which shows us the highest spiking CPU processes -
CODE: SELECT ALL
#!/bin/bash
date=$(date)
echo -e "$date" >> /tmp/checktopcpu.txt
ps -eo pcpu,args --sort=-%cpu|head >> /tmp/checktopcpu.txt
echo -e "\n" >> /tmp/checktopcpu.txt

I have attached the resulting checktocpu.txt file to this post (after redacting confidential information). The process most commonly using the most CPU is MySQL.

avandemore wrote:I'm not sure what's going on here then I will need to some more testing to figure out the retention status. I have a faint memory of there being some bug with it but I can't find it ATM.

Were you able to find anything out regarding retention status? I have started the process of migrating our Selenium checks to a new server, but that will take some time due to internal procedures. No need to keep this thread open while waiting on that to be completed, but if you have any additional suggestions regarding the load spikes and config application latency prior to locking the thread, that would be great.

avandemore · Post by **avandemore** » Wed Nov 30, 2016 11:59 am

That file gives a bit more insight. My best guess at this point is the sheer volume of DB activity is causing the load spikes. Since you seem to have a quite large installation, have you considered offloading the db? Here is a document on it:

https://assets.nagios.com/downloads/nag ... Server.pdf

This would allow Nagios a bit more breathing more. It will increase DB latency some, but will not generate as much load on the Nagios box. If you do choose to offload the db that system should have plenty of memory and fast disks. Backups of that new system also need to be accounted for. It generally works pretty well as long as you keep it on the standard port.

Also you could consider setting the values in XI > Admin > Performance Settings > Database > NDOUtils to the minimum useful setting.

Enabling XI > Admin > Performance Settings > Backend Cache may also help, but be sure to understand the ramifications of turning that on. It could easily be unsuitable for your environment.

Also how many cores are on this system? lscpu |grep -i socket multiply the values.

The recommend value for load_threshold in /usr/local/nagios/etc/pnp/npcd.cfg is 4 * the above value.

Also I assume you've got through this document in the past? https://assets.nagios.com/downloads/nag ... ios-XI.pdf

Post by **WillemDH** » Thu Dec 01, 2016 4:23 pm

Fyi, I'm seeing the same symptoms during an apply. Hosts and svces in downtime suddenly show up while in downtime or acknowledged. I've mentioned this a few times over the last years in previous threads, but no solution was ever found. As I said before imho this is a big issue as people are no longer trusting Nagios statusses. Is it a real problem? Or is someone just applying a configuration....?
I suggested a better way to make theapplies go faster with a forked Nagios process, but it needs to be a Nagios Core solution.... https://github.com/NagiosEnterprises/na ... issues/176

Post by **BanditBBS** » Thu Dec 01, 2016 5:45 pm

WillemDH wrote:Fyi, I'm seeing the same symptoms during an apply. Hosts and svces in downtime suddenly show up while in downtime or acknowledged. I've mentioned this a few times over the last years in previous threads, but no solution was ever found. As I said before imho this is a big issue as people are no longer trusting Nagios statusses. Is it a real problem? Or is someone just applying a configuration....?
I suggested a better way to make theapplies go faster with a forked Nagios process, but it needs to be a Nagios Core solution.... https://github.com/NagiosEnterprises/na ... issues/176

+1

Nagios Support Forum

Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals

Re: Load Spikes on 7 Hour Intervals