This is a important production issue. The performance data graphs are empty even though performance data is being collected and populated. I have tried suggestions in every forum post I could find regarding this kind of issue and have found no such luck. Please help.
[root@nagiosxi ~]# tail -25 /usr/local/nagios/var/perfdata.log
2015-09-11 21:52:03 [1789] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2015-09-11 21:52:03 [1788] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata/service-perfdata.1442029862-PID-1788 deleted
2015-09-11 21:52:03 [1789] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata/service-perfdata.1442029860-PID-1789 deleted
2015-09-11 21:52:03 [1788] [0] *** process_perfdata.pl terminated on signal ALRM
2015-09-11 21:52:03 [1789] [0] *** process_perfdata.pl terminated on signal ALRM
2015-09-11 21:52:03 [1785] [0] *** TIMEOUT: Timeout after 40 Sec. ****
2015-09-11 21:52:03 [1785] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2015-09-11 21:52:03 [1785] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2015-09-11 21:52:03 [1785] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata/host-perfdata.1442029860-PID-1785 deleted
2015-09-11 21:52:03 [1785] [0] *** process_perfdata.pl terminated on signal ALRM
2015-09-11 21:52:03 [1786] [0] *** TIMEOUT: Timeout after 40 Sec. ****
2015-09-11 21:52:03 [1786] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2015-09-11 21:52:03 [1786] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2015-09-11 21:52:03 [1786] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata/host-perfdata.1442029862-PID-1786 deleted
2015-09-11 21:52:03 [1786] [0] *** process_perfdata.pl terminated on signal ALRM
2015-09-11 21:55:18 [16841] [0] *** TIMEOUT: Timeout after 40 Sec. ****
2015-09-11 21:55:18 [16841] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2015-09-11 21:55:18 [16841] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2015-09-11 21:55:18 [16841] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata/service-perfdata.1442030059-PID-16841 deleted
2015-09-11 21:55:18 [16841] [0] *** process_perfdata.pl terminated on signal ALRM
2015-09-11 21:58:05 [21682] [0] *** TIMEOUT: Timeout after 40 Sec. ****
2015-09-11 21:58:05 [21682] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2015-09-11 21:58:05 [21682] [0] *** TIMEOUT: Please check your process_perfdata.cfg
2015-09-11 21:58:05 [21682] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata/service-perfdata.1442030123-PID-21682 deleted
2015-09-11 21:58:05 [21682] [0] *** process_perfdata.pl terminated on signal ALRM
[root@nagiosxi ~]# tail -25 /usr/local/nagios/var/npcd.log
[09-11-2015 21:52:03] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata/host-perfdata.1442029862'
[09-11-2015 21:55:18] NPCD: ERROR: Executed command exits with return code '1'
[09-11-2015 21:55:18] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata/service-perfdata.1442030059'
[09-11-2015 21:58:05] NPCD: ERROR: Executed command exits with return code '1'
[09-11-2015 21:58:05] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata/service-perfdata.1442030123'
[09-11-2015 21:58:20] NPCD: WARN: MAX load reached: load 33.140000/20.000000 at i=0
[09-11-2015 21:58:35] NPCD: WARN: MAX load reached: load 27.330000/20.000000 at i=1
[09-11-2015 21:59:38] NPCD: WARN: MAX load reached: load 38.010000/20.000000 at i=1
[09-11-2015 21:59:53] NPCD: WARN: MAX load reached: load 30.580000/20.000000 at i=1
[09-11-2015 22:00:08] NPCD: WARN: MAX load reached: load 27.250000/20.000000 at i=1
[09-25-2015 03:14:32] NPCD: Caught Termination Signal - Hasta la vista... baby
[09-25-2015 04:05:42] NPCD: npcd Daemon (0.4.14) started with PID=23254
[09-25-2015 04:05:42] NPCD: Please have a look at 'npcd -V' to get license information
[09-25-2015 04:05:42] NPCD: HINT: load_threshold is enabled - ('20.000000')
[09-29-2015 04:56:47] NPCD: Caught Termination Signal - Hasta la vista... baby
[09-29-2015 04:56:47] NPCD: npcd Daemon (0.4.14) started with PID=31396
[09-29-2015 04:56:47] NPCD: Please have a look at 'npcd -V' to get license information
[09-29-2015 04:56:47] NPCD: HINT: load_threshold is enabled - ('40.000000')
[10-01-2015 00:02:00] NPCD: npcd Daemon (0.4.14) started with PID=1667
[10-01-2015 00:02:00] NPCD: Please have a look at 'npcd -V' to get license information
[10-01-2015 00:02:00] NPCD: HINT: load_threshold is enabled - ('40.000000')
[10-11-2015 22:04:47] NPCD: Caught Termination Signal - Hasta la vista... baby
[10-11-2015 22:56:05] NPCD: npcd Daemon (0.4.14) started with PID=12628
[10-11-2015 22:56:05] NPCD: Please have a look at 'npcd -V' to get license information
[10-11-2015 22:56:05] NPCD: HINT: load_threshold is enabled - ('40.000000')
Can you run the following on your Nagios system to see if the performance files are spooling and that could be the cause of the issue? Please post the output.
I believe I have seen similar errors in my log file in the past.
The timeout in my process_perfdata.cfg file needed to be longer. Yours does too.
The file to edit is /usr/local/nagios/etc/pnp/process_perfdata.cfg
At my site, I increased the timeout. It is now set to 60.
I see from your log that your timeout is set to 40.
The system is throwing away your data because it takes
longer than 40 seconds to process your files.
This may not be your complete answer, there could be more to it.
But your log is saying timeout, and shows the file delete before it is processed.
That's pretty clear.
Next log file...
The npcd log file shows that it wants to process files,
but they were deleted before it could get to them.
I would make changes to the npcd.cfg file in that same directory as the other config file.
I would increase the number of npcd_max_threads. I have mine set to 15.
Also, I decreased the sleep_time to 6
I am not suggesting that you should use those numbers. I worked at this until
my settings were right for my site. Those numbers are where I ended up after trial and error.
Try making changes to those number slowly. Increase threads, reduce sleep.
Use "service npcd restart" after each change. Wait and see if the system starts working better.
The Timeout set to 60 should make the most difference, but npcd needs more
parallel processes so it can get the job done faster.
One last thought. Have you considered setting up a ram disk for these files?
You will still need the changes I suggested ram disk or no ram disk.
It is much easier to setup than I thought it would be. Nagios has
instructions in pdf somewhere. If you do use the ram disk... You just need
to keep an eye on the space used and make sure you know early before it fills
up if there is a problem. I have mine set to 500MB at this time. I have noticed it
filling up a couple times. One time I needed to restart npcd. Once I needed to
restart nagios. Lessons learned...
SteveBeauchemin wrote:One last thought. Have you considered setting up a ram disk for these files?
You will still need the changes I suggested ram disk or no ram disk.
It is much easier to setup than I thought it would be. Nagios has
instructions in pdf somewhere. If you do use the ram disk... You just need
to keep an eye on the space used and make sure you know early before it fills
up if there is a problem. I have mine set to 500MB at this time. I have noticed it
filling up a couple times. One time I needed to restart npcd. Once I needed to
restart nagios. Lessons learned...
Lets increase the logging verbosity and then take a deeper look into the logs. Follow the FAQ entry below to increase the log level of process_perfdata and npcd: