Performance data stops collecting.

lee.krause · Post by **lee.krause** » Thu Jul 14, 2016 12:58 am

I've noticed that the performance data stops intermittently for several hours.
As you can see from the graph the service stops collecting/showing the data in these gaps. It doesn't happen at a specific time and it is happening across the board not just on one server.

loaddata.png

-Linux Distribution and version? Red Hat Enterprise Linux Server release 6.8 (Santiago)
-32 or 64bit? 64bit
-VMware Image or Manual Install of XI? Manual install
-Nagios Version: XI 5.2.9

Thanks

ssax · Post by **ssax** » Thu Jul 14, 2016 9:09 am

Please follow this KB article, it should show you what issue you're hitting:

https://support.nagios.com/kb/article.php?id=9

Let us know the results.

Thank you

lee.krause · Post by **lee.krause** » Thu Jul 14, 2016 10:17 am

# ls /usr/local/nagios/var/spool/perfdata/ | wc -l
5
# ls /usr/local/nagios/var/spool/xidpe/ | wc -l
2

Looks like nothing excessive.

I did see this in the perfdata.log
# tail -f /usr/local/nagios/var/perfdata.log
2016-07-14 10:14:24 [7997] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509241.perfdata.host-PID-7997 deleted
2016-07-14 10:14:24 [7997] [0] *** Timeout while processing Host: "xxxxxxxxxxx" Service: "_HOST_"
2016-07-14 10:14:24 [7997] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:14:24 [7998] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509315.perfdata.service-PID-8852 deleted
2016-07-14 10:15:38 [8852] [0] *** Timeout while processing Host: "xxxxxxxxxxxxx" Service: "Ping"
2016-07-14 10:15:38 [8852] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509330.perfdata.service-PID-9049 deleted
2016-07-14 10:15:58 [9049] [0] *** Timeout while processing Host: "xxxxxxxxxxxxxxx" Service: "Uptime"
2016-07-14 10:15:58 [9049] [0] *** process_perfdata.pl terminated on signal ALRM

ssax · Post by **ssax** » Thu Jul 14, 2016 11:42 am

Did you increase the timeout in /usr/local/nagios/etc/pnp/process_perfdata.cfg?

Code: Select all

TIMEOUT = 20

Where you seeing any load warnings in your /usr/local/nagios/var/npcd.log?

Thank you

lee.krause · Post by **lee.krause** » Thu Jul 14, 2016 12:11 pm

I did change the timeout to 20.

From the npcd.log:
# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510638.perfdata.service'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'

Post by **lmiltchev** » Thu Jul 14, 2016 12:21 pm

Did you restart npcd after changing the timeout?

Code: Select all

service npcd restart

Are you still seeing the "Executed command exits with return code '7'" errors in the log AFTER restarting the npcd?

What is the load on the system?

Code: Select all

uptime

lee.krause · Post by **lee.krause** » Thu Jul 14, 2016 1:10 pm

# uptime
13:06:55 up 7 days, 22:33, 1 user, load average: 2.06, 2.02, 1.97

Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.

# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
[07-14-2016 13:06:00] NPCD: Caught Termination Signal - Hasta la vista... baby
[07-14-2016 13:06:00] NPCD: npcd Daemon (0.4.14) started with PID=29695
[07-14-2016 13:06:00] NPCD: Please have a look at 'npcd -V' to get license information
[07-14-2016 13:06:00] NPCD: HINT: load_threshold is enabled - ('10.000000')

Post by **lmiltchev** » Thu Jul 14, 2016 1:16 pm

Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.

Keep an eye on the performance data processing, and let us know if it stops again. Your load is not too high, at least not at the moment. Keep an eye on that too. If the load goes above the "load_threshold" value (as defined in the "/usr/local/nagios/etc/pnp/npcd.cfg"), this will cause the npcd to stop.
We will keep the thread open for a while in case you have more questions/issues.

lee.krause · Post by **lee.krause** » Mon Jul 18, 2016 10:33 am

After looking over the log from the weekend, looks like for some reason the MAX load reached:
[07-18-2016 01:04:23] NPCD: WARN: MAX load reached: load 10.460000/10.000000 at i=7
[07-18-2016 01:04:38] NPCD: WARN: MAX load reached: load 10.010000/10.000000 at i=7

After that several of these messages:
[07-18-2016 01:37:36] NPCD: ERROR: Executed command exits with return code '7'
[07-18-2016 01:37:36] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468823794.perfdata.service'

Then the load drops back to acceptable levels and the errors stop.

I'm not sure what happened at 1:04 system time. We are looking into it now.

Should I up the threshold?

What should the "/etc/audit/audit.rules" be set to?
Current:
# Increase kernel buffer size
-b 16384

We are getting a lot of these messages in /var/log/massages:
Jul 18 06:00:48 REDSA0ELPV016 kernel: audit: audit_backlog=16385 > audit_backlog_limit=16384
Jul 18 06:00:49 REDSA0ELPV016 kernel: audit: audit_lost=8854080 audit_rate_limit=0 audit_backlog_limit=16384
Jul 18 06:00:50 REDSA0ELPV016 kernel: audit: backlog limit exceeded

Post by **lmiltchev** » Mon Jul 18, 2016 10:51 am

If you had the resources, you should. What is the output of the following command?

Code: Select all

lscpu

The "load_threshold = 10.0" is for a single CPU machine. You could double it with a dual core:

Code: Select all

load_threshold = 20.0

With quad core you could use:

Code: Select all

load_threshold = 40.0

You will need to restart ncpd so that changes can take effect.

Code: Select all

service npcd restart

Nagios Support Forum

Performance data stops collecting.

Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.

Re: Performance data stops collecting.