Perf Data stopped working
Re: Perf Data stopped working
Sounds good. Let us know the result, this should fix it though. I'll leave this thread open to wait for your response.
Former Nagios Employee
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: Perf Data stopped working
Noticed that perf graphs yet again stopped working....
Saw these...
After reboot, I am seeing these....
Graphs are slowly coming back....
since reboot, the memory is fine... but perhaps it gets used up overtime and we have to schedule reboots to clear?
Before reboot
After reboot
Saw these...
Code: Select all
[1452108948] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1452108948.perfdata.host"
[1452108949] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1452108949.perfdata.service"
[1452108964] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1452108964.perfdata.host"
[1452108964] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1452108964.perfdata.service"
Code: Select all
# tail -25 /usr/local/nagios/var/perfdata.log
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] data2rrd called
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_runtime.rrd 1452071983:5.383616
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_runtime.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_rows.rrd 1452071983:1898
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_rows.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_errors.rrd 1452071983:2
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_errors.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_invalid.rrd 1452071983:0
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_invalid.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_skipped.rrd 1452071983:619
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_skipped.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_update.rrd 1452071983:1276
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_update.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_create.rrd 1452071983:0
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_create.rrd updated
2016-01-06 01:20:02 [28912] [1] PNP exiting (runtime 1.837839s) ...
Code: Select all
# tail -25 /usr/local/nagios/var/npcd.log
[01-06-2016 11:36:06] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:21] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:21] NPCD: DEBUG: load 1.470000/20.000000
[01-06-2016 11:36:21] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:21] NPCD: DEBUG: load 1.470000/20.000000
[01-06-2016 11:36:21] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:21] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:36] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:36] NPCD: DEBUG: load 1.280000/20.000000
[01-06-2016 11:36:36] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:36] NPCD: DEBUG: load 1.280000/20.000000
[01-06-2016 11:36:36] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:36] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:51] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:51] NPCD: DEBUG: load 1.370000/20.000000
[01-06-2016 11:36:51] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:51] NPCD: DEBUG: load 1.370000/20.000000
[01-06-2016 11:36:51] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:51] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:37:06] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:37:06] NPCD: DEBUG: load 1.220000/20.000000
[01-06-2016 11:37:06] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:37:06] NPCD: DEBUG: load 1.220000/20.000000
[01-06-2016 11:37:06] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:37:06] NPCD: No more files to process... waiting for 15 seconds
Code: Select all
[01-06-2016 11:55:32] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:55:32] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452109983.perfdata.service'
[01-06-2016 11:56:19] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:19] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110066.perfdata.service'
[01-06-2016 11:56:20] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:20] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110013.perfdata.service'
[01-06-2016 11:56:20] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:20] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110029.perfdata.service'
[01-06-2016 11:56:52] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:52] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110096.perfdata.service'
[01-06-2016 11:56:52] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:52] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110081.perfdata.service'
[01-06-2016 11:57:24] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:57:24] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110126.perfdata.service'
since reboot, the memory is fine... but perhaps it gets used up overtime and we have to schedule reboots to clear?
Before reboot
Code: Select all
# free -m
total used free shared buffers cached
Mem: 15950 14671 1279 27 157 3555
-/+ buffers/cache: 10957 4992
Swap: 2015 2015 0Code: Select all
# free -m
total used free shared buffers cached
Mem: 15950 3717 12232 11 97 2097
-/+ buffers/cache: 1521 14428
Swap: 2015 0 2015
Re: Perf Data stopped working
What is the output of these commands?
Thank you
Code: Select all
ls -l /usr/local/nagios/var/
lsof | grep "^nagios" | wc -l
free -m-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: Perf Data stopped working
Code: Select all
# ls -l /usr/local/nagios/var/
total 176148
drwxrwxr-x 2 nagios nagios 16384 Jan 6 23:59 archives
-rw-r--r-- 1 nagios nagios 0 Jan 7 09:33 host-perfdata
-rw-r--r-- 1 nagios nagios 939 Jan 7 09:27 nagios.configtest
-rw-r--r-- 1 nagios nagios 573493 Oct 9 09:14 nagios.debug
-rw-r--r-- 1 nagios nagios 1000050 Oct 9 09:14 nagios.debug.old
-rw-r--r-- 1 nagios nagios 6 Jan 7 09:27 nagios.lock
-rw-r--r-- 1 nagios nagios 5874608 Jan 7 09:33 nagios.log
-rw------- 1 nagios nagios 0 Oct 9 18:16 nagios.tmp6jAgXO
-rw------- 1 nagios users 24367104 Oct 7 11:09 nagios.tmpY5fx7C
-rw-r--r-- 1 nagios nagios 0 Sep 21 16:18 ndo2db.debug
-rw-r--r-- 1 nagios nagios 779612 Sep 21 16:18 ndo2db.debug.old
-rw-r--r-- 1 nagios nagios 5 Jan 6 11:41 ndo2db.lock
-rw-r--r-- 1 nagios nagios 0 Jan 7 09:27 ndomod.tmp
srwxr-xr-x 1 nagios nagios 0 Jan 6 11:41 ndo.sock
-rw-r--r-- 1 nagios nagios 5106068 Jan 7 09:33 npcd.log
-rw-r--r-- 1 nagios nagios 10485832 Jan 6 19:37 npcd.log.old
-rw-r--r-- 1 nagios nagios 23692657 Jan 7 09:27 objects.cache
-rw-r--r-- 1 nagios nagios 23692657 Jan 7 09:27 objects.precache
-rw-rw-r-- 1 nagios nagios 5036937 Jan 7 09:33 perfdata.log
-rw------- 1 nagios nagios 39927624 Jan 7 09:27 retention.dat
drwxrwsr-x 2 nagios nagcmd 4096 Jan 7 09:27 rw
-rw-r--r-- 1 nagios nagios 0 Jan 7 09:33 service-perfdata
drwxr-xr-x 5 nagios nagios 4096 Feb 24 2015 spool
drwxr-xr-x 2 nagios nagios 4096 Jan 7 09:33 stats
-rw-rw-r-- 1 nagios nagios 39646890 Jan 7 09:33 status.dat
-rw-r--r-- 1 root root 105675 Jul 16 16:58 wmitest.txtCode: Select all
# lsof | grep "^nagios" | wc -l
196
Code: Select all
# free -m
total used free shared buffers cached
Mem: 15950 13913 2036 35 160 10243
-/+ buffers/cache: 3510 12440
Swap: 2015 34 1981Re: Perf Data stopped working
Do you happen to be using mod_gearman? We've noticed several issues lately where people running gearman seem to have a memory leak.
Former Nagios employee
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: Perf Data stopped working
We are running mod_gearman. Is there any fix to this if it is a suspected memory leak?
Re: Perf Data stopped working
Not sure yet, we have seen a few people with memory issues that have all been using gearman, we're just trying to correlate.
What version of gearman are you running?
Sorry, meant to ask for this since it looked like you were hitting resource limits (processes, files, etc):
What version of gearman are you running?
Code: Select all
rpm -qa | grep gearmanCode: Select all
cat /proc/`cat /usr/local/nagios/var/nagios.lock`/limits-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: Perf Data stopped working
Code: Select all
# rpm -qa | grep gearman
libgearman-1.1.8-2.el6.x86_64
gearmand-1.1.8-2.el6.x86_64
mod_gearman-1.5.0b1-1.el6.x86_64
Code: Select all
# cat /proc/`cat /usr/local/nagios/var/nagios.lock`/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 4096 4096 processes
Max open files 200000 200000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 63700 63700 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Re: Perf Data stopped working
Please post the sanitized output of these commands:
Thank you
Code: Select all
grep -v ^# /etc/mod_gearman/mod_gearman_neb.conf
grep gearman /usr/local/nagios/etc/nagios.cfg
grep -v ^# /etc/mod_gearman/mod_gearman_worker.confRe: Perf Data stopped working
In additional to my previous post, you might want to up your limits for max open processes:
Then reboot the system.
Code: Select all
echo "nagios - maxprocs 14865" >> /etc/security/limits.conf