Page 1 of 3

Perf Data stopped working

Posted: Fri Nov 20, 2015 1:09 pm
by CFT6Server
I noticed that all performance data graphs stopped working, so looking at the logs, we are seeing this message in the nagios.log file

Code: Select all

[1448042617] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1448042617.perfdata.service"
[1448042617] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1448042617.perfdata.host"
I am seeing these happening in the perfdata.log:

Code: Select all

2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: Timeout after 20 secs. ***
2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: Please check your npcd.cfg
2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1448042811.perfdata.host-PID-20276 deleted
2015-11-20 10:07:25 [20276] [0] *** Timeout while processing Host: "L2E-LAN-B02" Service: "_HOST_"
2015-11-20 10:07:25 [20276] [0] *** process_perfdata.pl terminated on signal ALRM
NPCD.log

Code: Select all

==> /usr/local/nagios/var/npcd.log <==
[11-20-2015 10:07:25] NPCD: ERROR: Executed command exits with return code '7'
[11-20-2015 10:07:25] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1448042811.perfdata.service'
[11-20-2015 10:07:25] NPCD: No more files to process... waiting for 15 seconds
Looks like this stopped since the 17th, but since we haven't been looking at it, we didn't notice until now....

Re: Perf Data stopped working

Posted: Fri Nov 20, 2015 2:04 pm
by lmiltchev
Can you increase the perf data logging verbosity by following the steps, outlined on our wiki FAQ page?

https://support.nagios.com/wiki/index.p ... leshooting

After this, restart npcd

Code: Select all

service npcd restart
run the following commands and show us the output in code wraps:

Code: Select all

ls /usr/local/nagios/var/spool/xidpe | wc -l
ls /usr/local/nagios/var/spool/perfdata | wc -l
ls /usr/local/nagios/var/spool/checkresults | wc -l
ps -ef | grep perf
grep rrdcached /usr/local/nagios/etc/pnp/process_perfdata.cfg
tail -100 /usr/local/nagios/var/npcd.log
tail -100 /usr/local/nagios/var/perfdata.log

Re: Perf Data stopped working

Posted: Fri Nov 20, 2015 2:39 pm
by CFT6Server

Code: Select all

# ls /usr/local/nagios/var/spool/xidpe | wc -l
0
# ls /usr/local/nagios/var/spool/perfdata | wc -l
3
# ls /usr/local/nagios/var/spool/checkresults | wc -l
1098
# ps -ef | grep perf
nagios   15073 15045  0 11:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios   15090 15073  0 11:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
root     31479  9842  0 11:36 pts/0    00:00:00 grep perf
# grep rrdcached /usr/local/nagios/etc/pnp/process_perfdata.cfg
# EXPERIMENTAL rrdcached Support
# RRD_DAEMON_OPTS = unix:/tmp/rrdcached.sock
will send the tail in PM.

Re: Perf Data stopped working

Posted: Fri Nov 20, 2015 2:42 pm
by CFT6Server
Looks like the perf data start after 10:25am.. Only thing I've done is applied configuration to force service to restart. so the debug isn't going to catch what happened I think.

Re: Perf Data stopped working

Posted: Fri Nov 20, 2015 3:11 pm
by rkennedy
I wonder if it crashed. Can you run and post the output -

Code: Select all

free -m
If / when this happens again, before you do anything can you please post the file /var/log/messages for us to review?

Update: File received and placed in team share.

Re: Perf Data stopped working

Posted: Fri Nov 20, 2015 3:24 pm
by CFT6Server
# free -m
total used free shared buffers cached
Mem: 9891 7978 1913 31 92 6146
-/+ buffers/cache: 1740 8151
Swap: 2015 287 1728


wish we had detected it, because we are missing 3 days of performance. but I will go back and check. I'll pm you our messages log which contains events from Nov15 to now.

Re: Perf Data stopped working

Posted: Mon Nov 23, 2015 10:35 am
by rkennedy
Looking through your log files I noticed memory running short -

Code: Select all

Nov 20 09:38:22 kdcnagxi01 nagios: Warning: fork() in my_system_r() failed for command
I also noticed this issue happend back in August for you https://support.nagios.com/forum/viewto ... 16&t=34162 as well.

How many hosts and service checks are you running? How many CPUs do you have allocated?

Can you post the output of the following command -

Code: Select all

top|head -5

Re: Perf Data stopped working

Posted: Mon Nov 23, 2015 12:56 pm
by CFT6Server
We currently have 10vCPU allocated tot his XI instance. We have about 20k service checks.

Code: Select all

top - 09:49:01 up 25 days, 13:01,  1 user,  load average: 0.80, 0.69, 0.64
Tasks: 332 total,   1 running, 331 sleeping,   0 stopped,   0 zombie
Cpu(s): 16.7%us,  8.2%sy,  0.0%ni, 66.9%id,  6.1%wa,  0.1%hi,  2.1%si,  0.0%st
Mem:  10129068k total,  9306496k used,   822572k free,   111880k buffers
Swap:  2064380k total,   350276k used,  1714104k free,  3796356k cached

Re: Perf Data stopped working

Posted: Mon Nov 23, 2015 1:12 pm
by rkennedy
It looks like CPU isn't the issue, however your ram seems fairly low for 20k service checks. Can you increase it up to 16GB? I believe that will fix the memory issues that you saw in the past, and now.

Re: Perf Data stopped working

Posted: Tue Nov 24, 2015 5:25 pm
by CFT6Server
Thanks. I will bump the RAM up and monitor the performance and memory usage.