Performance data stops collecting.
-
lee.krause
- Posts: 86
- Joined: Wed Jan 20, 2016 8:38 am
Performance data stops collecting.
I've noticed that the performance data stops intermittently for several hours.
As you can see from the graph the service stops collecting/showing the data in these gaps. It doesn't happen at a specific time and it is happening across the board not just on one server.
-Linux Distribution and version? Red Hat Enterprise Linux Server release 6.8 (Santiago)
-32 or 64bit? 64bit
-VMware Image or Manual Install of XI? Manual install
-Nagios Version: XI 5.2.9
Thanks
As you can see from the graph the service stops collecting/showing the data in these gaps. It doesn't happen at a specific time and it is happening across the board not just on one server.
-Linux Distribution and version? Red Hat Enterprise Linux Server release 6.8 (Santiago)
-32 or 64bit? 64bit
-VMware Image or Manual Install of XI? Manual install
-Nagios Version: XI 5.2.9
Thanks
You do not have the required permissions to view the files attached to this post.
Re: Performance data stops collecting.
Please follow this KB article, it should show you what issue you're hitting:
https://support.nagios.com/kb/article.php?id=9
Let us know the results.
Thank you
https://support.nagios.com/kb/article.php?id=9
Let us know the results.
Thank you
-
lee.krause
- Posts: 86
- Joined: Wed Jan 20, 2016 8:38 am
Re: Performance data stops collecting.
# ls /usr/local/nagios/var/spool/perfdata/ | wc -l
5
# ls /usr/local/nagios/var/spool/xidpe/ | wc -l
2
Looks like nothing excessive.
I did see this in the perfdata.log
# tail -f /usr/local/nagios/var/perfdata.log
2016-07-14 10:14:24 [7997] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509241.perfdata.host-PID-7997 deleted
2016-07-14 10:14:24 [7997] [0] *** Timeout while processing Host: "xxxxxxxxxxx" Service: "_HOST_"
2016-07-14 10:14:24 [7997] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:14:24 [7998] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509315.perfdata.service-PID-8852 deleted
2016-07-14 10:15:38 [8852] [0] *** Timeout while processing Host: "xxxxxxxxxxxxx" Service: "Ping"
2016-07-14 10:15:38 [8852] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509330.perfdata.service-PID-9049 deleted
2016-07-14 10:15:58 [9049] [0] *** Timeout while processing Host: "xxxxxxxxxxxxxxx" Service: "Uptime"
2016-07-14 10:15:58 [9049] [0] *** process_perfdata.pl terminated on signal ALRM
5
# ls /usr/local/nagios/var/spool/xidpe/ | wc -l
2
Looks like nothing excessive.
I did see this in the perfdata.log
# tail -f /usr/local/nagios/var/perfdata.log
2016-07-14 10:14:24 [7997] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509241.perfdata.host-PID-7997 deleted
2016-07-14 10:14:24 [7997] [0] *** Timeout while processing Host: "xxxxxxxxxxx" Service: "_HOST_"
2016-07-14 10:14:24 [7997] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:14:24 [7998] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:38 [8852] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509315.perfdata.service-PID-8852 deleted
2016-07-14 10:15:38 [8852] [0] *** Timeout while processing Host: "xxxxxxxxxxxxx" Service: "Ping"
2016-07-14 10:15:38 [8852] [0] *** process_perfdata.pl terminated on signal ALRM
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Timeout after 5 secs. ***
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: Please check your npcd.cfg
2016-07-14 10:15:58 [9049] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1468509330.perfdata.service-PID-9049 deleted
2016-07-14 10:15:58 [9049] [0] *** Timeout while processing Host: "xxxxxxxxxxxxxxx" Service: "Uptime"
2016-07-14 10:15:58 [9049] [0] *** process_perfdata.pl terminated on signal ALRM
Re: Performance data stops collecting.
Did you increase the timeout in /usr/local/nagios/etc/pnp/process_perfdata.cfg?
Where you seeing any load warnings in your /usr/local/nagios/var/npcd.log?
Thank you
Code: Select all
TIMEOUT = 20Thank you
-
lee.krause
- Posts: 86
- Joined: Wed Jan 20, 2016 8:38 am
Re: Performance data stops collecting.
I did change the timeout to 20.
From the npcd.log:
# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510638.perfdata.service'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
From the npcd.log:
# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510638.perfdata.service'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
Re: Performance data stops collecting.
Did you restart npcd after changing the timeout?
Are you still seeing the "Executed command exits with return code '7'" errors in the log AFTER restarting the npcd?
What is the load on the system?
Code: Select all
service npcd restartWhat is the load on the system?
Code: Select all
uptimeBe sure to check out our Knowledgebase for helpful articles and solutions!
-
lee.krause
- Posts: 86
- Joined: Wed Jan 20, 2016 8:38 am
Re: Performance data stops collecting.
# uptime
13:06:55 up 7 days, 22:33, 1 user, load average: 2.06, 2.02, 1.97
Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.
# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
[07-14-2016 13:06:00] NPCD: Caught Termination Signal - Hasta la vista... baby
[07-14-2016 13:06:00] NPCD: npcd Daemon (0.4.14) started with PID=29695
[07-14-2016 13:06:00] NPCD: Please have a look at 'npcd -V' to get license information
[07-14-2016 13:06:00] NPCD: HINT: load_threshold is enabled - ('10.000000')
13:06:55 up 7 days, 22:33, 1 user, load average: 2.06, 2.02, 1.97
Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.
# tail -f /usr/local/nagios/var/npcd.log
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510639.perfdata.host'
[07-14-2016 10:37:34] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:34] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510620.perfdata.service'
[07-14-2016 10:37:55] NPCD: ERROR: Executed command exits with return code '7'
[07-14-2016 10:37:55] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468510663.perfdata.service'
[07-14-2016 13:06:00] NPCD: Caught Termination Signal - Hasta la vista... baby
[07-14-2016 13:06:00] NPCD: npcd Daemon (0.4.14) started with PID=29695
[07-14-2016 13:06:00] NPCD: Please have a look at 'npcd -V' to get license information
[07-14-2016 13:06:00] NPCD: HINT: load_threshold is enabled - ('10.000000')
Re: Performance data stops collecting.
Keep an eye on the performance data processing, and let us know if it stops again. Your load is not too high, at least not at the moment. Keep an eye on that too. If the load goes above the "load_threshold" value (as defined in the "/usr/local/nagios/etc/pnp/npcd.cfg"), this will cause the npcd to stop.Looks like the messages have stopped. I restarted at 13:06(3 minutes ago) and nothing in the log since.
We will keep the thread open for a while in case you have more questions/issues.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
lee.krause
- Posts: 86
- Joined: Wed Jan 20, 2016 8:38 am
Re: Performance data stops collecting.
After looking over the log from the weekend, looks like for some reason the MAX load reached:
[07-18-2016 01:04:23] NPCD: WARN: MAX load reached: load 10.460000/10.000000 at i=7
[07-18-2016 01:04:38] NPCD: WARN: MAX load reached: load 10.010000/10.000000 at i=7
After that several of these messages:
[07-18-2016 01:37:36] NPCD: ERROR: Executed command exits with return code '7'
[07-18-2016 01:37:36] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468823794.perfdata.service'
Then the load drops back to acceptable levels and the errors stop.
I'm not sure what happened at 1:04 system time. We are looking into it now.
Should I up the threshold?
What should the "/etc/audit/audit.rules" be set to?
Current:
# Increase kernel buffer size
-b 16384
We are getting a lot of these messages in /var/log/massages:
Jul 18 06:00:48 REDSA0ELPV016 kernel: audit: audit_backlog=16385 > audit_backlog_limit=16384
Jul 18 06:00:49 REDSA0ELPV016 kernel: audit: audit_lost=8854080 audit_rate_limit=0 audit_backlog_limit=16384
Jul 18 06:00:50 REDSA0ELPV016 kernel: audit: backlog limit exceeded
[07-18-2016 01:04:23] NPCD: WARN: MAX load reached: load 10.460000/10.000000 at i=7
[07-18-2016 01:04:38] NPCD: WARN: MAX load reached: load 10.010000/10.000000 at i=7
After that several of these messages:
[07-18-2016 01:37:36] NPCD: ERROR: Executed command exits with return code '7'
[07-18-2016 01:37:36] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1468823794.perfdata.service'
Then the load drops back to acceptable levels and the errors stop.
I'm not sure what happened at 1:04 system time. We are looking into it now.
Should I up the threshold?
What should the "/etc/audit/audit.rules" be set to?
Current:
# Increase kernel buffer size
-b 16384
We are getting a lot of these messages in /var/log/massages:
Jul 18 06:00:48 REDSA0ELPV016 kernel: audit: audit_backlog=16385 > audit_backlog_limit=16384
Jul 18 06:00:49 REDSA0ELPV016 kernel: audit: audit_lost=8854080 audit_rate_limit=0 audit_backlog_limit=16384
Jul 18 06:00:50 REDSA0ELPV016 kernel: audit: backlog limit exceeded
Re: Performance data stops collecting.
If you had the resources, you should. What is the output of the following command?
The "load_threshold = 10.0" is for a single CPU machine. You could double it with a dual core:
With quad core you could use:
You will need to restart ncpd so that changes can take effect.
Code: Select all
lscpuCode: Select all
load_threshold = 20.0Code: Select all
load_threshold = 40.0Code: Select all
service npcd restartBe sure to check out our Knowledgebase for helpful articles and solutions!