Perf Data stopped working

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Perf Data stopped working

Post by CFT6Server »

I noticed that all performance data graphs stopped working, so looking at the logs, we are seeing this message in the nagios.log file

Code: Select all

[1448042617] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1448042617.perfdata.service"
[1448042617] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1448042617.perfdata.host"
I am seeing these happening in the perfdata.log:

Code: Select all

2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: Timeout after 20 secs. ***
2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: Please check your npcd.cfg
2015-11-20 10:07:25 [20276] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1448042811.perfdata.host-PID-20276 deleted
2015-11-20 10:07:25 [20276] [0] *** Timeout while processing Host: "L2E-LAN-B02" Service: "_HOST_"
2015-11-20 10:07:25 [20276] [0] *** process_perfdata.pl terminated on signal ALRM
NPCD.log

Code: Select all

==> /usr/local/nagios/var/npcd.log <==
[11-20-2015 10:07:25] NPCD: ERROR: Executed command exits with return code '7'
[11-20-2015 10:07:25] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1448042811.perfdata.service'
[11-20-2015 10:07:25] NPCD: No more files to process... waiting for 15 seconds
Looks like this stopped since the 17th, but since we haven't been looking at it, we didn't notice until now....
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Perf Data stopped working

Post by lmiltchev »

Can you increase the perf data logging verbosity by following the steps, outlined on our wiki FAQ page?

https://support.nagios.com/wiki/index.p ... leshooting

After this, restart npcd

Code: Select all

service npcd restart
run the following commands and show us the output in code wraps:

Code: Select all

ls /usr/local/nagios/var/spool/xidpe | wc -l
ls /usr/local/nagios/var/spool/perfdata | wc -l
ls /usr/local/nagios/var/spool/checkresults | wc -l
ps -ef | grep perf
grep rrdcached /usr/local/nagios/etc/pnp/process_perfdata.cfg
tail -100 /usr/local/nagios/var/npcd.log
tail -100 /usr/local/nagios/var/perfdata.log
Be sure to check out our Knowledgebase for helpful articles and solutions!
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

Code: Select all

# ls /usr/local/nagios/var/spool/xidpe | wc -l
0
# ls /usr/local/nagios/var/spool/perfdata | wc -l
3
# ls /usr/local/nagios/var/spool/checkresults | wc -l
1098
# ps -ef | grep perf
nagios   15073 15045  0 11:36 ?        00:00:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php > /usr/local/nagiosxi/var/perfdataproc.log 2>&1
nagios   15090 15073  0 11:36 ?        00:00:00 /usr/bin/php -q /usr/local/nagiosxi/cron/perfdataproc.php
root     31479  9842  0 11:36 pts/0    00:00:00 grep perf
# grep rrdcached /usr/local/nagios/etc/pnp/process_perfdata.cfg
# EXPERIMENTAL rrdcached Support
# RRD_DAEMON_OPTS = unix:/tmp/rrdcached.sock
will send the tail in PM.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

Looks like the perf data start after 10:25am.. Only thing I've done is applied configuration to force service to restart. so the debug isn't going to catch what happened I think.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Perf Data stopped working

Post by rkennedy »

I wonder if it crashed. Can you run and post the output -

Code: Select all

free -m
If / when this happens again, before you do anything can you please post the file /var/log/messages for us to review?

Update: File received and placed in team share.
Former Nagios Employee
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

# free -m
total used free shared buffers cached
Mem: 9891 7978 1913 31 92 6146
-/+ buffers/cache: 1740 8151
Swap: 2015 287 1728


wish we had detected it, because we are missing 3 days of performance. but I will go back and check. I'll pm you our messages log which contains events from Nov15 to now.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Perf Data stopped working

Post by rkennedy »

Looking through your log files I noticed memory running short -

Code: Select all

Nov 20 09:38:22 kdcnagxi01 nagios: Warning: fork() in my_system_r() failed for command
I also noticed this issue happend back in August for you https://support.nagios.com/forum/viewto ... 16&t=34162 as well.

How many hosts and service checks are you running? How many CPUs do you have allocated?

Can you post the output of the following command -

Code: Select all

top|head -5
Former Nagios Employee
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

We currently have 10vCPU allocated tot his XI instance. We have about 20k service checks.

Code: Select all

top - 09:49:01 up 25 days, 13:01,  1 user,  load average: 0.80, 0.69, 0.64
Tasks: 332 total,   1 running, 331 sleeping,   0 stopped,   0 zombie
Cpu(s): 16.7%us,  8.2%sy,  0.0%ni, 66.9%id,  6.1%wa,  0.1%hi,  2.1%si,  0.0%st
Mem:  10129068k total,  9306496k used,   822572k free,   111880k buffers
Swap:  2064380k total,   350276k used,  1714104k free,  3796356k cached
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Perf Data stopped working

Post by rkennedy »

It looks like CPU isn't the issue, however your ram seems fairly low for 20k service checks. Can you increase it up to 16GB? I believe that will fix the memory issues that you saw in the past, and now.
Former Nagios Employee
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

Thanks. I will bump the RAM up and monitor the performance and memory usage.
Locked