Perf Data stopped working

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Perf Data stopped working

Post by rkennedy »

Sounds good. Let us know the result, this should fix it though. I'll leave this thread open to wait for your response.
Former Nagios Employee
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

Noticed that perf graphs yet again stopped working....

Saw these...

Code: Select all

[1452108948] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1452108948.perfdata.host"
[1452108949] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1452108949.perfdata.service"
[1452108964] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1452108964.perfdata.host"
[1452108964] Warning: fork() in my_system_r() failed for command "/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1452108964.perfdata.service"

Code: Select all

# tail -25 /usr/local/nagios/var/perfdata.log
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] No Custom Template found for pnp-runtime (/usr/local/nagios/etc/pnp/check_commands/pnp-runtime.cfg)
2016-01-06 01:20:02 [28912] [2] Template is pnp-runtime.php
2016-01-06 01:20:02 [28912] [2] data2rrd called
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_runtime.rrd 1452071983:5.383616
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_runtime.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_rows.rrd 1452071983:1898
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_rows.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_errors.rrd 1452071983:2
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_errors.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_invalid.rrd 1452071983:0
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_invalid.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_skipped.rrd 1452071983:619
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_skipped.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_update.rrd 1452071983:1276
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_update.rrd updated
2016-01-06 01:20:02 [28912] [2] RRDs::update /usr/local/nagios/share/perfdata/.pnp-internal/runtime_create.rrd 1452071983:0
2016-01-06 01:20:02 [28912] [2] /usr/local/nagios/share/perfdata/.pnp-internal/runtime_create.rrd updated
2016-01-06 01:20:02 [28912] [1] PNP exiting (runtime 1.837839s) ...

Code: Select all

# tail -25 /usr/local/nagios/var/npcd.log
[01-06-2016 11:36:06] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:21] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:21] NPCD: DEBUG: load 1.470000/20.000000
[01-06-2016 11:36:21] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:21] NPCD: DEBUG: load 1.470000/20.000000
[01-06-2016 11:36:21] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:21] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:36] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:36] NPCD: DEBUG: load 1.280000/20.000000
[01-06-2016 11:36:36] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:36] NPCD: DEBUG: load 1.280000/20.000000
[01-06-2016 11:36:36] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:36] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:36:51] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:36:51] NPCD: DEBUG: load 1.370000/20.000000
[01-06-2016 11:36:51] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:36:51] NPCD: DEBUG: load 1.370000/20.000000
[01-06-2016 11:36:51] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:36:51] NPCD: No more files to process... waiting for 15 seconds
[01-06-2016 11:37:06] NPCD: Found 2 files in /usr/local/nagios/var/spool/perfdata/
[01-06-2016 11:37:06] NPCD: DEBUG: load 1.220000/20.000000
[01-06-2016 11:37:06] NPCD: ThreadCounter 0/5 File is .
[01-06-2016 11:37:06] NPCD: DEBUG: load 1.220000/20.000000
[01-06-2016 11:37:06] NPCD: ThreadCounter 0/5 File is ..
[01-06-2016 11:37:06] NPCD: No more files to process... waiting for 15 seconds
After reboot, I am seeing these....

Code: Select all

[01-06-2016 11:55:32] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:55:32] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452109983.perfdata.service'
[01-06-2016 11:56:19] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:19] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110066.perfdata.service'
[01-06-2016 11:56:20] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:20] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110013.perfdata.service'
[01-06-2016 11:56:20] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:20] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110029.perfdata.service'
[01-06-2016 11:56:52] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:52] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110096.perfdata.service'
[01-06-2016 11:56:52] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:56:52] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110081.perfdata.service'
[01-06-2016 11:57:24] NPCD: ERROR: Executed command exits with return code '7'
[01-06-2016 11:57:24] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1452110126.perfdata.service'
Graphs are slowly coming back....
since reboot, the memory is fine... but perhaps it gets used up overtime and we have to schedule reboots to clear?

Before reboot

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         15950      14671       1279         27        157       3555
-/+ buffers/cache:      10957       4992
Swap:         2015       2015          0
After reboot

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         15950       3717      12232         11         97       2097
-/+ buffers/cache:       1521      14428
Swap:         2015          0       2015
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Perf Data stopped working

Post by ssax »

What is the output of these commands?

Code: Select all

ls -l /usr/local/nagios/var/
lsof | grep "^nagios" | wc -l
free -m
Thank you
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

Code: Select all

# ls -l /usr/local/nagios/var/
total 176148
drwxrwxr-x 2 nagios nagios    16384 Jan  6 23:59 archives
-rw-r--r-- 1 nagios nagios        0 Jan  7 09:33 host-perfdata
-rw-r--r-- 1 nagios nagios      939 Jan  7 09:27 nagios.configtest
-rw-r--r-- 1 nagios nagios   573493 Oct  9 09:14 nagios.debug
-rw-r--r-- 1 nagios nagios  1000050 Oct  9 09:14 nagios.debug.old
-rw-r--r-- 1 nagios nagios        6 Jan  7 09:27 nagios.lock
-rw-r--r-- 1 nagios nagios  5874608 Jan  7 09:33 nagios.log
-rw------- 1 nagios nagios        0 Oct  9 18:16 nagios.tmp6jAgXO
-rw------- 1 nagios users  24367104 Oct  7 11:09 nagios.tmpY5fx7C
-rw-r--r-- 1 nagios nagios        0 Sep 21 16:18 ndo2db.debug
-rw-r--r-- 1 nagios nagios   779612 Sep 21 16:18 ndo2db.debug.old
-rw-r--r-- 1 nagios nagios        5 Jan  6 11:41 ndo2db.lock
-rw-r--r-- 1 nagios nagios        0 Jan  7 09:27 ndomod.tmp
srwxr-xr-x 1 nagios nagios        0 Jan  6 11:41 ndo.sock
-rw-r--r-- 1 nagios nagios  5106068 Jan  7 09:33 npcd.log
-rw-r--r-- 1 nagios nagios 10485832 Jan  6 19:37 npcd.log.old
-rw-r--r-- 1 nagios nagios 23692657 Jan  7 09:27 objects.cache
-rw-r--r-- 1 nagios nagios 23692657 Jan  7 09:27 objects.precache
-rw-rw-r-- 1 nagios nagios  5036937 Jan  7 09:33 perfdata.log
-rw------- 1 nagios nagios 39927624 Jan  7 09:27 retention.dat
drwxrwsr-x 2 nagios nagcmd     4096 Jan  7 09:27 rw
-rw-r--r-- 1 nagios nagios        0 Jan  7 09:33 service-perfdata
drwxr-xr-x 5 nagios nagios     4096 Feb 24  2015 spool
drwxr-xr-x 2 nagios nagios     4096 Jan  7 09:33 stats
-rw-rw-r-- 1 nagios nagios 39646890 Jan  7 09:33 status.dat
-rw-r--r-- 1 root   root     105675 Jul 16 16:58 wmitest.txt

Code: Select all

# lsof | grep "^nagios" | wc -l
196

Code: Select all

# free -m
             total       used       free     shared    buffers     cached
Mem:         15950      13913       2036         35        160      10243
-/+ buffers/cache:       3510      12440
Swap:         2015         34       1981
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Perf Data stopped working

Post by tmcdonald »

Do you happen to be using mod_gearman? We've noticed several issues lately where people running gearman seem to have a memory leak.
Former Nagios employee
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

We are running mod_gearman. Is there any fix to this if it is a suspected memory leak?
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Perf Data stopped working

Post by ssax »

Not sure yet, we have seen a few people with memory issues that have all been using gearman, we're just trying to correlate.

What version of gearman are you running?

Code: Select all

rpm -qa | grep gearman
Sorry, meant to ask for this since it looked like you were hitting resource limits (processes, files, etc):

Code: Select all

cat /proc/`cat /usr/local/nagios/var/nagios.lock`/limits
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Perf Data stopped working

Post by CFT6Server »

Code: Select all

# rpm -qa | grep gearman
libgearman-1.1.8-2.el6.x86_64
gearmand-1.1.8-2.el6.x86_64
mod_gearman-1.5.0b1-1.el6.x86_64

Code: Select all

# cat /proc/`cat /usr/local/nagios/var/nagios.lock`/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             4096                 4096                 processes
Max open files            200000               200000               files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       63700                63700                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Perf Data stopped working

Post by ssax »

Please post the sanitized output of these commands:

Code: Select all

grep -v ^# /etc/mod_gearman/mod_gearman_neb.conf
grep gearman /usr/local/nagios/etc/nagios.cfg
grep -v ^# /etc/mod_gearman/mod_gearman_worker.conf
Thank you
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Perf Data stopped working

Post by ssax »

In additional to my previous post, you might want to up your limits for max open processes:

Code: Select all

echo "nagios          -       maxprocs        14865" >> /etc/security/limits.conf
Then reboot the system.
Locked