Performance data stops collecting.

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
lee.krause
Posts: 86
Joined: Wed Jan 20, 2016 8:38 am

Performance data stops collecting.

Post by lee.krause »

I've noticed that the performance data stops intermittently for several hours.
As you can see from the graph the service stops collecting/showing the data in these gaps. It doesn't happen at a specific time and it is happening across the board not just on one server.
loaddata.png
-Linux Distribution and version? Red Hat Enterprise Linux Server release 6.8 (Santiago)
-32 or 64bit? 64bit
-VMware Image or Manual Install of XI? Manual install
-Nagios Version: XI 5.2.9


Thanks
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Performance data stops collecting.

Post by ssax »

Please follow this KB article, it should show you what issue you're hitting:

https://support.nagios.com/kb/article.php?id=9

Let us know the results.

Thank you
lee.krause
Posts: 86
Joined: Wed Jan 20, 2016 8:38 am

Re: Performance data stops collecting.

Post by lee.krause »

# ls /usr/local/nagios/var/spool/perfdata/ | wc -l
5
# ls /usr/local/nagios/var/spool/xidpe/ | wc -l
2

Looks like nothing excessive.

I did see this in the perfdata.log
# tail -f /usr/local/nagios/var/perfdata.log
2016-07-14 10:14:24 [7997] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509241.perfdata.host-PID-7997 deleted
2016-07-14 10:14:24 [7997] [0] *** Timeout while processing Host: "xxxxxxxxxxx" Service: "_HOST_"
2016-07-14 10:14:24 [7997] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:14:24 [7998] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509315.perfdata.service-PID-8852 deleted
2016-07-14 10:15:38 [8852] [0] *** Timeout while processing Host: "xxxxxxxxxxxxx" Service: "Ping"
2016-07-14 10:15:38 [8852] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509330.perfdata.service-PID-9049 deleted
2016-07-14 10:15:58 [9049] [0] *** Timeout while processing Host: "xxxxxxxxxxxxxxx" Service: "Uptime"
2016-07-14 10:15:58 [9049] [0] *** process_perfdata.pl terminated on signal ALRM
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Performance data stops collecting.

Post by ssax »

Did you increase the timeout in /usr/local/nagios/etc/pnp/process_perfdata.cfg?

Code: Select all

TIMEOUT = 20
Where you seeing any load warnings in your /usr/local/nagios/var/npcd.log?

Thank you
lee.krause
Posts: 86
Joined: Wed Jan 20, 2016 8:38 am

Re: Performance data stops collecting.

Post by lee.krause »

I did change the timeout to 20.

From the npcd.log:
# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510638.perfdata.service'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Performance data stops collecting.

Post by lmiltchev »

Did you restart npcd after changing the timeout?

Code: Select all

service npcd restart
Are you still seeing the "Executed command exits with return code '7'" errors in the log AFTER restarting the npcd?

What is the load on the system?

Code: Select all

uptime
Be sure to check out our Knowledgebase for helpful articles and solutions!
lee.krause
Posts: 86
Joined: Wed Jan 20, 2016 8:38 am

Re: Performance data stops collecting.

Post by lee.krause »

# uptime
13:06:55 up 7 days, 22:33, 1 user, load average: 2.06, 2.02, 1.97

Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.

# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
[07-14-2016 13:06:00] NPCD: Caught Termination Signal - Hasta la vista... baby
[07-14-2016 13:06:00] NPCD: npcd Daemon (0.4.14) started with PID=29695
[07-14-2016 13:06:00] NPCD: Please have a look at 'npcd -V' to get license information
[07-14-2016 13:06:00] NPCD: HINT: load_threshold is enabled - ('10.000000')
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Performance data stops collecting.

Post by lmiltchev »

Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.
Keep an eye on the performance data processing, and let us know if it stops again. Your load is not too high, at least not at the moment. Keep an eye on that too. If the load goes above the "load_threshold" value (as defined in the "/usr/local/nagios/etc/pnp/npcd.cfg"), this will cause the npcd to stop.
We will keep the thread open for a while in case you have more questions/issues.
Be sure to check out our Knowledgebase for helpful articles and solutions!
lee.krause
Posts: 86
Joined: Wed Jan 20, 2016 8:38 am

Re: Performance data stops collecting.

Post by lee.krause »

After looking over the log from the weekend, looks like for some reason the MAX load reached:
[07-18-2016 01:04:23] NPCD: WARN: MAX load reached: load 10.460000/10.000000 at i=7
[07-18-2016 01:04:38] NPCD: WARN: MAX load reached: load 10.010000/10.000000 at i=7

After that several of these messages:
[07-18-2016 01:37:36] NPCD: ERROR: Executed command exits with return code '7'
[07-18-2016 01:37:36] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468823794.perfdata.service'

Then the load drops back to acceptable levels and the errors stop.

I'm not sure what happened at 1:04 system time. We are looking into it now.

Should I up the threshold?

What should the "/etc/audit/audit.rules" be set to?
Current:
# Increase kernel buffer size
-b 16384

We are getting a lot of these messages in /var/log/massages:
Jul 18 06:00:48 REDSA0ELPV016 kernel: audit: audit_backlog=16385 > audit_backlog_limit=16384
Jul 18 06:00:49 REDSA0ELPV016 kernel: audit: audit_lost=8854080 audit_rate_limit=0 audit_backlog_limit=16384
Jul 18 06:00:50 REDSA0ELPV016 kernel: audit: backlog limit exceeded
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Performance data stops collecting.

Post by lmiltchev »

If you had the resources, you should. What is the output of the following command?

Code: Select all

lscpu
The "load_threshold = 10.0" is for a single CPU machine. You could double it with a dual core:

Code: Select all

load_threshold = 20.0
With quad core you could use:

Code: Select all

load_threshold = 40.0
You will need to restart ncpd so that changes can take effect.

Code: Select all

service npcd restart
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked