Page 1 of 2

Performance data stops collecting.

Posted: Thu Jul 14, 2016 12:58 am
by lee.krause
I've noticed that the performance data stops intermittently for several hours.
As you can see from the graph the service stops collecting/showing the data in these gaps. It doesn't happen at a specific time and it is happening across the board not just on one server.
loaddata.png
-Linux Distribution and version? Red Hat Enterprise Linux Server release 6.8 (Santiago)
-32 or 64bit? 64bit
-VMware Image or Manual Install of XI? Manual install
-Nagios Version: XI 5.2.9


Thanks

Re: Performance data stops collecting.

Posted: Thu Jul 14, 2016 9:09 am
by ssax
Please follow this KB article, it should show you what issue you're hitting:

https://support.nagios.com/kb/article.php?id=9

Let us know the results.

Thank you

Re: Performance data stops collecting.

Posted: Thu Jul 14, 2016 10:17 am
by lee.krause
# ls /usr/local/nagios/var/spool/perfdata/ | wc -l
5
# ls /usr/local/nagios/var/spool/xidpe/ | wc -l
2

Looks like nothing excessive.

I did see this in the perfdata.log
# tail -f /usr/local/nagios/var/perfdata.log
2016-07-14 10:14:24 [7997] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509241.perfdata.host-PID-7997 deleted
2016-07-14 10:14:24 [7997] [0] *** Timeout while processing Host: "xxxxxxxxxxx" Service: "_HOST_"
2016-07-14 10:14:24 [7997] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:14:24 [7998] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509315.perfdata.service-PID-8852 deleted
2016-07-14 10:15:38 [8852] [0] *** Timeout while processing Host: "xxxxxxxxxxxxx" Service: "Ping"
2016-07-14 10:15:38 [8852] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509330.perfdata.service-PID-9049 deleted
2016-07-14 10:15:58 [9049] [0] *** Timeout while processing Host: "xxxxxxxxxxxxxxx" Service: "Uptime"
2016-07-14 10:15:58 [9049] [0] *** process_perfdata.pl terminated on signal ALRM

Re: Performance data stops collecting.

Posted: Thu Jul 14, 2016 11:42 am
by ssax
Did you increase the timeout in /usr/local/nagios/etc/pnp/process_perfdata.cfg?

Code: Select all

TIMEOUT = 20
Where you seeing any load warnings in your /usr/local/nagios/var/npcd.log?

Thank you

Re: Performance data stops collecting.

Posted: Thu Jul 14, 2016 12:11 pm
by lee.krause
I did change the timeout to 20.

From the npcd.log:
# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510638.perfdata.service'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'

Re: Performance data stops collecting.

Posted: Thu Jul 14, 2016 12:21 pm
by lmiltchev
Did you restart npcd after changing the timeout?

Code: Select all

service npcd restart
Are you still seeing the "Executed command exits with return code '7'" errors in the log AFTER restarting the npcd?

What is the load on the system?

Code: Select all

uptime

Re: Performance data stops collecting.

Posted: Thu Jul 14, 2016 1:10 pm
by lee.krause
# uptime
13:06:55 up 7 days, 22:33, 1 user, load average: 2.06, 2.02, 1.97

Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.

# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
[07-14-2016 13:06:00] NPCD: Caught Termination Signal - Hasta la vista... baby
[07-14-2016 13:06:00] NPCD: npcd Daemon (0.4.14) started with PID=29695
[07-14-2016 13:06:00] NPCD: Please have a look at 'npcd -V' to get license information
[07-14-2016 13:06:00] NPCD: HINT: load_threshold is enabled - ('10.000000')

Re: Performance data stops collecting.

Posted: Thu Jul 14, 2016 1:16 pm
by lmiltchev
Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.
Keep an eye on the performance data processing, and let us know if it stops again. Your load is not too high, at least not at the moment. Keep an eye on that too. If the load goes above the "load_threshold" value (as defined in the "/usr/local/nagios/etc/pnp/npcd.cfg"), this will cause the npcd to stop.
We will keep the thread open for a while in case you have more questions/issues.

Re: Performance data stops collecting.

Posted: Mon Jul 18, 2016 10:33 am
by lee.krause
After looking over the log from the weekend, looks like for some reason the MAX load reached:
[07-18-2016 01:04:23] NPCD: WARN: MAX load reached: load 10.460000/10.000000 at i=7
[07-18-2016 01:04:38] NPCD: WARN: MAX load reached: load 10.010000/10.000000 at i=7

After that several of these messages:
[07-18-2016 01:37:36] NPCD: ERROR: Executed command exits with return code '7'
[07-18-2016 01:37:36] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468823794.perfdata.service'

Then the load drops back to acceptable levels and the errors stop.

I'm not sure what happened at 1:04 system time. We are looking into it now.

Should I up the threshold?

What should the "/etc/audit/audit.rules" be set to?
Current:
# Increase kernel buffer size
-b 16384

We are getting a lot of these messages in /var/log/massages:
Jul 18 06:00:48 REDSA0ELPV016 kernel: audit: audit_backlog=16385 > audit_backlog_limit=16384
Jul 18 06:00:49 REDSA0ELPV016 kernel: audit: audit_lost=8854080 audit_rate_limit=0 audit_backlog_limit=16384
Jul 18 06:00:50 REDSA0ELPV016 kernel: audit: backlog limit exceeded

Re: Performance data stops collecting.

Posted: Mon Jul 18, 2016 10:51 am
by lmiltchev
If you had the resources, you should. What is the output of the following command?

Code: Select all

lscpu
The "load_threshold = 10.0" is for a single CPU machine. You could double it with a dual core:

Code: Select all

load_threshold = 20.0
With quad core you could use:

Code: Select all

load_threshold = 40.0
You will need to restart ncpd so that changes can take effect.

Code: Select all

service npcd restart