Page 2 of 3

Re: Load Spikes on 7 Hour Intervals

Posted: Tue Nov 15, 2016 4:26 pm
by drcentner
avandemore wrote:You should be able to resolve this part by setting retain_status_information=1:
It appears that we already have that set unless I'm missing something:

Code: Select all

grep -r retain_status_information /usr/local/nagios/etc/
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/servicetemplates.cfg:       retain_status_information                     1
/usr/local/nagios/etc/servicetemplates.cfg:       retain_status_information                     1

Code: Select all

mysql> SELECT count(*) FROM nagios.nagios_hosts WHERE retain_status_information = 1;
+----------+
| count(*) |
+----------+
|     2010 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM nagios.nagios_hosts WHERE retain_status_information = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM nagios.nagios_services WHERE retain_status_information = 1;
+----------+
| count(*) |
+----------+
|    14543 |
+----------+
1 row in set (0.02 sec)

mysql> SELECT count(*) FROM nagios.nagios_services WHERE retain_status_information = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.03 sec)
Does anything need to be changed/modified in /usr/local/nagios/etc/nagios.cfg ?

Code: Select all

grep retain /usr/local/nagios/etc/nagios.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
use_retained_program_state=1
use_retained_scheduling_info=1

Re: Load Spikes on 7 Hour Intervals

Posted: Tue Nov 15, 2016 5:04 pm
by avandemore
Is state_retention_file defined?

Also pretty sure setting things like retained_host_attribute_mask=0 would be better off to be left as default eg not present in the configuration.

EDIT: Upon further checking that setting is fine.

Re: Load Spikes on 7 Hour Intervals

Posted: Tue Nov 15, 2016 5:13 pm
by drcentner
avandemore wrote:Is state_retention_file defined?
Yes

Code: Select all

grep retention /usr/local/nagios/etc/nagios.cfg
retention_update_interval=60
state_retention_file=/usr/local/nagios/var/retention.dat
avandemore wrote:Also pretty sure setting things like retained_host_attribute_mask=0 would be better off to be left as default.
I have commented out all the definitions starting with "retained_" and ending in "_mask".

Code: Select all

grep retain /usr/local/nagios/etc/nagios.cfg
#retained_contact_host_attribute_mask=0
#retained_contact_service_attribute_mask=0
#retained_host_attribute_mask=0
#retained_process_host_attribute_mask=0
#retained_process_service_attribute_mask=0
#retained_service_attribute_mask=0
retain_state_information=1
use_retained_program_state=1
use_retained_scheduling_info=1

Re: Load Spikes on 7 Hour Intervals

Posted: Tue Nov 15, 2016 5:16 pm
by avandemore
I edited my last post, but seemingly too late.
EDIT: Upon further checking that setting is fine.

Re: Load Spikes on 7 Hour Intervals

Posted: Tue Nov 15, 2016 5:28 pm
by drcentner
avandemore wrote:I edited my last post, but seemingly too late.
EDIT: Upon further checking that setting is fine.
No worries. After seeing your initial comment, I read up on those in the Nagios documentation, and it probably is best to leave them commented out, unless you actually recommend enabling them.

Re: Load Spikes on 7 Hour Intervals

Posted: Tue Nov 15, 2016 5:45 pm
by avandemore
I'm not sure what's going on here then I will need to some more testing to figure out the retention status. I have a faint memory of there being some bug with it but I can't find it ATM.

Re: Load Spikes on 7 Hour Intervals

Posted: Tue Nov 29, 2016 8:14 pm
by drcentner
rkennedy wrote:Do you have a local check running for CPU on the localhost machine? If so, please apply an event handler with this as the contents and this will produce a log file which shows us the highest spiking CPU processes -
CODE: SELECT ALL
#!/bin/bash
date=$(date)
echo -e "$date" >> /tmp/checktopcpu.txt
ps -eo pcpu,args --sort=-%cpu|head >> /tmp/checktopcpu.txt
echo -e "\n" >> /tmp/checktopcpu.txt
I have attached the resulting checktocpu.txt file to this post (after redacting confidential information). The process most commonly using the most CPU is MySQL.


avandemore wrote:I'm not sure what's going on here then I will need to some more testing to figure out the retention status. I have a faint memory of there being some bug with it but I can't find it ATM.
Were you able to find anything out regarding retention status? I have started the process of migrating our Selenium checks to a new server, but that will take some time due to internal procedures. No need to keep this thread open while waiting on that to be completed, but if you have any additional suggestions regarding the load spikes and config application latency prior to locking the thread, that would be great.

Re: Load Spikes on 7 Hour Intervals

Posted: Wed Nov 30, 2016 11:59 am
by avandemore
That file gives a bit more insight. My best guess at this point is the sheer volume of DB activity is causing the load spikes. Since you seem to have a quite large installation, have you considered offloading the db? Here is a document on it:

https://assets.nagios.com/downloads/nag ... Server.pdf

This would allow Nagios a bit more breathing more. It will increase DB latency some, but will not generate as much load on the Nagios box. If you do choose to offload the db that system should have plenty of memory and fast disks. Backups of that new system also need to be accounted for. It generally works pretty well as long as you keep it on the standard port.

Also you could consider setting the values in XI > Admin > Performance Settings > Database > NDOUtils to the minimum useful setting.

Enabling XI > Admin > Performance Settings > Backend Cache may also help, but be sure to understand the ramifications of turning that on. It could easily be unsuitable for your environment.

Also how many cores are on this system? lscpu |grep -i socket multiply the values.

The recommend value for load_threshold in /usr/local/nagios/etc/pnp/npcd.cfg is 4 * the above value.

Also I assume you've got through this document in the past? https://assets.nagios.com/downloads/nag ... ios-XI.pdf

Re: Load Spikes on 7 Hour Intervals

Posted: Thu Dec 01, 2016 4:23 pm
by WillemDH
Fyi, I'm seeing the same symptoms during an apply. Hosts and svces in downtime suddenly show up while in downtime or acknowledged. I've mentioned this a few times over the last years in previous threads, but no solution was ever found. As I said before imho this is a big issue as people are no longer trusting Nagios statusses. Is it a real problem? Or is someone just applying a configuration....?
I suggested a better way to make theapplies go faster with a forked Nagios process, but it needs to be a Nagios Core solution.... https://github.com/NagiosEnterprises/na ... issues/176

Re: Load Spikes on 7 Hour Intervals

Posted: Thu Dec 01, 2016 5:45 pm
by BanditBBS
WillemDH wrote:Fyi, I'm seeing the same symptoms during an apply. Hosts and svces in downtime suddenly show up while in downtime or acknowledged. I've mentioned this a few times over the last years in previous threads, but no solution was ever found. As I said before imho this is a big issue as people are no longer trusting Nagios statusses. Is it a real problem? Or is someone just applying a configuration....?
I suggested a better way to make theapplies go faster with a forked Nagios process, but it needs to be a Nagios Core solution.... https://github.com/NagiosEnterprises/na ... issues/176
+1