Page 2 of 3

Re: Perf Data stopped working

Posted: Tue Nov 24, 2015 5:29 pm
by rkennedy
Sounds good. Let us know the result, this should fix it though. I'll leave this thread open to wait for your response.

Re: Perf Data stopped working

Posted: Wed Jan 06, 2016 3:01 pm
by CFT6Server
Noticed that perf graphs yet again stopped working....

Saw these...

Code: Select all

[1452108948] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1452108948.perfdata.host"
[1452108949] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1452108949.perfdata.service"
[1452108964] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1452108964.perfdata.host"
[1452108964] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1452108964.perfdata.service"

Code: Select all

# tail -25 /usr/local/nagios/var/perfdata.log
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] data2rrd called
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_runtime.rrd 1452071983:5.383616
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_runtime.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_rows.rrd 1452071983:1898
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_rows.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_errors.rrd 1452071983:2
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_errors.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_invalid.rrd 1452071983:0
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_invalid.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_skipped.rrd 1452071983:619
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_skipped.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_update.rrd 1452071983:1276
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_update.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_create.rrd 1452071983:0
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_create.rrd updated
2016-01-06 01:20:02 [28912] [1] PNP exiting (runtime 1.837839s) ...

Code: Select all

# tail -25 /usr/local/nagios/var/npcd.log
[01-06-2016 11:36:06] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:21] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:21] NPCD: DEBUG: load 1.470000/20.000000
[01-06-2016 11:36:21] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:21] NPCD: DEBUG: load 1.470000/20.000000
[01-06-2016 11:36:21] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:21] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:36] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:36] NPCD: DEBUG: load 1.280000/20.000000
[01-06-2016 11:36:36] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:36] NPCD: DEBUG: load 1.280000/20.000000
[01-06-2016 11:36:36] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:36] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:51] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:51] NPCD: DEBUG: load 1.370000/20.000000
[01-06-2016 11:36:51] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:51] NPCD: DEBUG: load 1.370000/20.000000
[01-06-2016 11:36:51] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:51] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:37:06] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:37:06] NPCD: DEBUG: load 1.220000/20.000000
[01-06-2016 11:37:06] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:37:06] NPCD: DEBUG: load 1.220000/20.000000
[01-06-2016 11:37:06] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:37:06] NPCD: No more files to process... waiting for 15 seconds
After reboot, I am seeing these....

Code: Select all

[01-06-2016 11:55:32] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:55:32] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452109983.perfdata.service'
[01-06-2016 11:56:19] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:19] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110066.perfdata.service'
[01-06-2016 11:56:20] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:20] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110013.perfdata.service'
[01-06-2016 11:56:20] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:20] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110029.perfdata.service'
[01-06-2016 11:56:52] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:52] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110096.perfdata.service'
[01-06-2016 11:56:52] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:52] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110081.perfdata.service'
[01-06-2016 11:57:24] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:57:24] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110126.perfdata.service'
Graphs are slowly coming back....
since reboot, the memory is fine... but perhaps it gets used up overtime and we have to schedule reboots to clear?

Before reboot

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         15950      14671       1279         27        157       3555
-/+ buffers/cache:      10957       4992
Swap:         2015       2015          0
After reboot

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         15950       3717      12232         11         97       2097
-/+ buffers/cache:       1521      14428
Swap:         2015          0       2015

Re: Perf Data stopped working

Posted: Thu Jan 07, 2016 10:26 am
by ssax
What is the output of these commands?

Code: Select all

ls -l /usr/local/nagios/var/
lsof | grep "^nagios" | wc -l
free -m
Thank you

Re: Perf Data stopped working

Posted: Thu Jan 07, 2016 12:34 pm
by CFT6Server

Code: Select all

# ls -l /usr/local/nagios/var/
total 176148
drwxrwxr-x 2 nagios nagios    16384 Jan  6 23:59 archives
-rw-r--r-- 1 nagios nagios        0 Jan  7 09:33 host-perfdata
-rw-r--r-- 1 nagios nagios      939 Jan  7 09:27 nagios.configtest
-rw-r--r-- 1 nagios nagios   573493 Oct  9 09:14 nagios.debug
-rw-r--r-- 1 nagios nagios  1000050 Oct  9 09:14 nagios.debug.old
-rw-r--r-- 1 nagios nagios        6 Jan  7 09:27 nagios.lock
-rw-r--r-- 1 nagios nagios  5874608 Jan  7 09:33 nagios.log
-rw------- 1 nagios nagios        0 Oct  9 18:16 nagios.tmp6jAgXO
-rw------- 1 nagios users  24367104 Oct  7 11:09 nagios.tmpY5fx7C
-rw-r--r-- 1 nagios nagios        0 Sep 21 16:18 ndo2db.debug
-rw-r--r-- 1 nagios nagios   779612 Sep 21 16:18 ndo2db.debug.old
-rw-r--r-- 1 nagios nagios        5 Jan  6 11:41 ndo2db.lock
-rw-r--r-- 1 nagios nagios        0 Jan  7 09:27 ndomod.tmp
srwxr-xr-x 1 nagios nagios        0 Jan  6 11:41 ndo.sock
-rw-r--r-- 1 nagios nagios  5106068 Jan  7 09:33 npcd.log
-rw-r--r-- 1 nagios nagios 10485832 Jan  6 19:37 npcd.log.old
-rw-r--r-- 1 nagios nagios 23692657 Jan  7 09:27 objects.cache
-rw-r--r-- 1 nagios nagios 23692657 Jan  7 09:27 objects.precache
-rw-rw-r-- 1 nagios nagios  5036937 Jan  7 09:33 perfdata.log
-rw------- 1 nagios nagios 39927624 Jan  7 09:27 retention.dat
drwxrwsr-x 2 nagios nagcmd     4096 Jan  7 09:27 rw
-rw-r--r-- 1 nagios nagios        0 Jan  7 09:33 service-perfdata
drwxr-xr-x 5 nagios nagios     4096 Feb 24  2015 spool
drwxr-xr-x 2 nagios nagios     4096 Jan  7 09:33 stats
-rw-rw-r-- 1 nagios nagios 39646890 Jan  7 09:33 status.dat
-rw-r--r-- 1 root   root     105675 Jul 16 16:58 wmitest.txt

Code: Select all

# lsof | grep "^nagios" | wc -l
196

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         15950      13913       2036         35        160      10243
-/+ buffers/cache:       3510      12440
Swap:         2015         34       1981

Re: Perf Data stopped working

Posted: Thu Jan 07, 2016 4:56 pm
by tmcdonald
Do you happen to be using mod_gearman? We've noticed several issues lately where people running gearman seem to have a memory leak.

Re: Perf Data stopped working

Posted: Thu Jan 07, 2016 8:11 pm
by CFT6Server
We are running mod_gearman. Is there any fix to this if it is a suspected memory leak?

Re: Perf Data stopped working

Posted: Fri Jan 08, 2016 3:16 pm
by ssax
Not sure yet, we have seen a few people with memory issues that have all been using gearman, we're just trying to correlate.

What version of gearman are you running?

Code: Select all

rpm -qa | grep gearman
Sorry, meant to ask for this since it looked like you were hitting resource limits (processes, files, etc):

Code: Select all

cat /proc/`cat /usr/local/nagios/var/nagios.lock`/limits

Re: Perf Data stopped working

Posted: Fri Jan 08, 2016 3:36 pm
by CFT6Server

Code: Select all

# rpm -qa | grep gearman
libgearman-1.1.8-2.el6.x86_64
gearmand-1.1.8-2.el6.x86_64
mod_gearman-1.5.0b1-1.el6.x86_64

Code: Select all

# cat /proc/`cat /usr/local/nagios/var/nagios.lock`/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             4096                 4096                 processes
Max open files            200000               200000               files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       63700                63700                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Re: Perf Data stopped working

Posted: Mon Jan 11, 2016 11:00 am
by ssax
Please post the sanitized output of these commands:

Code: Select all

grep -v ^# /etc/mod_gearman/mod_gearman_neb.conf
grep gearman /usr/local/nagios/etc/nagios.cfg
grep -v ^# /etc/mod_gearman/mod_gearman_worker.conf
Thank you

Re: Perf Data stopped working

Posted: Mon Jan 11, 2016 11:09 am
by ssax
In additional to my previous post, you might want to up your limits for max open processes:

Code: Select all

echo "nagios          -       maxprocs        14865" >> /etc/security/limits.conf
Then reboot the system.