Load Spikes on 7 Hour Intervals

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
drcentner
Posts: 20
Joined: Tue Jun 14, 2016 11:39 am
Location: Virginia, USA

Re: Load Spikes on 7 Hour Intervals

Post by drcentner »

avandemore wrote:You should be able to resolve this part by setting retain_status_information=1:
It appears that we already have that set unless I'm missing something:

Code: Select all

grep -r retain_status_information /usr/local/nagios/etc/
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/hosttemplates.cfg:       retain_status_information                1
/usr/local/nagios/etc/servicetemplates.cfg:       retain_status_information                     1
/usr/local/nagios/etc/servicetemplates.cfg:       retain_status_information                     1

Code: Select all

mysql> SELECT count(*) FROM nagios.nagios_hosts WHERE retain_status_information = 1;
+----------+
| count(*) |
+----------+
|     2010 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM nagios.nagios_hosts WHERE retain_status_information = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT count(*) FROM nagios.nagios_services WHERE retain_status_information = 1;
+----------+
| count(*) |
+----------+
|    14543 |
+----------+
1 row in set (0.02 sec)

mysql> SELECT count(*) FROM nagios.nagios_services WHERE retain_status_information = 0;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.03 sec)
Does anything need to be changed/modified in /usr/local/nagios/etc/nagios.cfg ?

Code: Select all

grep retain /usr/local/nagios/etc/nagios.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
use_retained_program_state=1
use_retained_scheduling_info=1
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Load Spikes on 7 Hour Intervals

Post by avandemore »

Is state_retention_file defined?

Also pretty sure setting things like retained_host_attribute_mask=0 would be better off to be left as default eg not present in the configuration.

EDIT: Upon further checking that setting is fine.
Previous Nagios employee
User avatar
drcentner
Posts: 20
Joined: Tue Jun 14, 2016 11:39 am
Location: Virginia, USA

Re: Load Spikes on 7 Hour Intervals

Post by drcentner »

avandemore wrote:Is state_retention_file defined?
Yes

Code: Select all

grep retention /usr/local/nagios/etc/nagios.cfg
retention_update_interval=60
state_retention_file=/usr/local/nagios/var/retention.dat
avandemore wrote:Also pretty sure setting things like retained_host_attribute_mask=0 would be better off to be left as default.
I have commented out all the definitions starting with "retained_" and ending in "_mask".

Code: Select all

grep retain /usr/local/nagios/etc/nagios.cfg
#retained_contact_host_attribute_mask=0
#retained_contact_service_attribute_mask=0
#retained_host_attribute_mask=0
#retained_process_host_attribute_mask=0
#retained_process_service_attribute_mask=0
#retained_service_attribute_mask=0
retain_state_information=1
use_retained_program_state=1
use_retained_scheduling_info=1
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Load Spikes on 7 Hour Intervals

Post by avandemore »

I edited my last post, but seemingly too late.
EDIT: Upon further checking that setting is fine.
Previous Nagios employee
User avatar
drcentner
Posts: 20
Joined: Tue Jun 14, 2016 11:39 am
Location: Virginia, USA

Re: Load Spikes on 7 Hour Intervals

Post by drcentner »

avandemore wrote:I edited my last post, but seemingly too late.
EDIT: Upon further checking that setting is fine.
No worries. After seeing your initial comment, I read up on those in the Nagios documentation, and it probably is best to leave them commented out, unless you actually recommend enabling them.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Load Spikes on 7 Hour Intervals

Post by avandemore »

I'm not sure what's going on here then I will need to some more testing to figure out the retention status. I have a faint memory of there being some bug with it but I can't find it ATM.
Previous Nagios employee
User avatar
drcentner
Posts: 20
Joined: Tue Jun 14, 2016 11:39 am
Location: Virginia, USA

Re: Load Spikes on 7 Hour Intervals

Post by drcentner »

rkennedy wrote:Do you have a local check running for CPU on the localhost machine? If so, please apply an event handler with this as the contents and this will produce a log file which shows us the highest spiking CPU processes -
CODE: SELECT ALL
#!/bin/bash
date=$(date)
echo -e "$date" >> /tmp/checktopcpu.txt
ps -eo pcpu,args --sort=-%cpu|head >> /tmp/checktopcpu.txt
echo -e "\n" >> /tmp/checktopcpu.txt
I have attached the resulting checktocpu.txt file to this post (after redacting confidential information). The process most commonly using the most CPU is MySQL.


avandemore wrote:I'm not sure what's going on here then I will need to some more testing to figure out the retention status. I have a faint memory of there being some bug with it but I can't find it ATM.
Were you able to find anything out regarding retention status? I have started the process of migrating our Selenium checks to a new server, but that will take some time due to internal procedures. No need to keep this thread open while waiting on that to be completed, but if you have any additional suggestions regarding the load spikes and config application latency prior to locking the thread, that would be great.
You do not have the required permissions to view the files attached to this post.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Load Spikes on 7 Hour Intervals

Post by avandemore »

That file gives a bit more insight. My best guess at this point is the sheer volume of DB activity is causing the load spikes. Since you seem to have a quite large installation, have you considered offloading the db? Here is a document on it:

https://assets.nagios.com/downloads/nag ... Server.pdf

This would allow Nagios a bit more breathing more. It will increase DB latency some, but will not generate as much load on the Nagios box. If you do choose to offload the db that system should have plenty of memory and fast disks. Backups of that new system also need to be accounted for. It generally works pretty well as long as you keep it on the standard port.

Also you could consider setting the values in XI > Admin > Performance Settings > Database > NDOUtils to the minimum useful setting.

Enabling XI > Admin > Performance Settings > Backend Cache may also help, but be sure to understand the ramifications of turning that on. It could easily be unsuitable for your environment.

Also how many cores are on this system? lscpu |grep -i socket multiply the values.

The recommend value for load_threshold in /usr/local/nagios/etc/pnp/npcd.cfg is 4 * the above value.

Also I assume you've got through this document in the past? https://assets.nagios.com/downloads/nag ... ios-XI.pdf
Previous Nagios employee
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Load Spikes on 7 Hour Intervals

Post by WillemDH »

Fyi, I'm seeing the same symptoms during an apply. Hosts and svces in downtime suddenly show up while in downtime or acknowledged. I've mentioned this a few times over the last years in previous threads, but no solution was ever found. As I said before imho this is a big issue as people are no longer trusting Nagios statusses. Is it a real problem? Or is someone just applying a configuration....?
I suggested a better way to make theapplies go faster with a forked Nagios process, but it needs to be a Nagios Core solution.... https://github.com/NagiosEnterprises/na ... issues/176
Nagios XI 5.8.1
https://outsideit.net
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Load Spikes on 7 Hour Intervals

Post by BanditBBS »

WillemDH wrote:Fyi, I'm seeing the same symptoms during an apply. Hosts and svces in downtime suddenly show up while in downtime or acknowledged. I've mentioned this a few times over the last years in previous threads, but no solution was ever found. As I said before imho this is a big issue as people are no longer trusting Nagios statusses. Is it a real problem? Or is someone just applying a configuration....?
I suggested a better way to make theapplies go faster with a forked Nagios process, but it needs to be a Nagios Core solution.... https://github.com/NagiosEnterprises/na ... issues/176
+1
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked