performance data stopped again

pnewlon · Post by **pnewlon** » Tue Apr 08, 2014 1:54 pm

I am really getting frustrated with these servers! My newest box (32 bit Nagios converted to 64 bit Nagios from VM image) quit processing performance data at about 1030PM last night.

npcd still running, tail npcd.log
[04-08-2014 14:23:55] NPCD: Processing file '1396981418.perfdata.host'
[04-08-2014 14:23:55] NPCD: Processing file '1396981418.perfdata.service'
[04-08-2014 14:23:55] NPCD: Processing file '1396981432.perfdata.host'
[04-08-2014 14:23:55] NPCD: Processing file '1396981432.perfdata.service'
[04-08-2014 14:23:56] NPCD: No more files to process... waiting for 15 seconds
[04-08-2014 14:24:11] NPCD: No more files to process... waiting for 15 seconds

perfdata.log has not been updated since 1050 last night, tail perfdata.log
2014-04-07 22:29:02 [8618] [0] *** TIMEOUT: Please check your npcd.cfg
2014-04-07 22:29:02 [8618] [0] *** TIMEOUT: /var/nagiosramdisk/spool/perfdata//1396924115.perfdata.service-PID-8618 deleted
2014-04-07 22:29:02 [8618] [0] *** Timeout while processing Host: "00926_OB01-MP" Service: "memAvailMB"
2014-04-07 22:29:02 [8618] [0] *** process_perfdata.pl terminated on signal ALRM
2014-04-07 22:50:45 [25469] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-04-07 22:50:45 [25469] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-04-07 22:50:45 [25469] [0] *** TIMEOUT: Please check your npcd.cfg
2014-04-07 22:50:45 [25469] [0] *** TIMEOUT: /var/nagiosramdisk/spool/perfdata//1396925420.perfdata.service-PID-25469 deleted
2014-04-07 22:50:45 [25469] [0] *** Timeout while processing Host: "01639_HTTP" Service: "Ping"
2014-04-07 22:50:45 [25469] [0] *** process_perfdata.pl terminated on signal ALRM

I compared the npcd.cfg of my working system to the one that quit last night and there is no difference so I don't know what I am supposed to check in my npcd.cfg!

I stopped and restarted npcd and ndo2db, no joy. Stopped and started all Nagios services and still no joy.

service nagiosxi stop
service npcd stop
service ndo2db stop
service nagios stop
service postgresql stop
service mysqld stop
service httpd stop

service httpd start
service mysqld start
service postgresql start
service nagios start
service ndo2db start
service npcd start
service nagiosxi start

server has not been restarted:
[root@LPNAGV04 var]# uptime
14:31:28 up 1 day, 6:56, 1 user, load average: 12.93, 9.26, 5.53

So I shutdown all Nagios services again and ran repairmysql on nagios and nagiosql then ran dbmaint. STILL not getting perfdata and perfdata.log has not been updated, yet files are coming and going in the perfdata spool directory.

[root@LPNAGV04 var]# ls -lt /var/nagiosramdisk/spool/perfdata/
total 328
-rw-rw-r-- 1 nagios nagios 175933 Apr 8 14:51 1396983103.perfdata.service-PID-23515
-rw-rw-r-- 1 nagios nagios 30763 Apr 8 14:51 1396983103.perfdata.host-PID-23516
-rw-rw-r-- 1 nagios nagios 95863 Apr 8 14:51 1396983088.perfdata.service-PID-23514
-rw-rw-r-- 1 nagios nagios 27680 Apr 8 14:51 1396983088.perfdata.host-PID-23513
[root@LPNAGV04 var]# ls -lt /var/nagiosramdisk/spool/perfdata/
total 268
-rw-rw-r-- 1 nagios nagios 175933 Apr 8 14:51 1396983103.perfdata.service-PID-23515
-rw-rw-r-- 1 nagios nagios 95863 Apr 8 14:51 1396983088.perfdata.service-PID-23514
[root@LPNAGV04 var]# ls -lt /var/nagiosramdisk/spool/perfdata/

I've missed something and am clueless as to what it is....

pnewlon · Post by **pnewlon** » Tue Apr 08, 2014 3:07 pm

I give. Nothing about this makes sense. I found '/usr/local/nagios/etc/pnp/process_perfdata.cfg' and changed LOG_LEVEL from 0 to 2. Waited a few minutes and my perfdata.log file started getting tons of entries. And now I have perfdata. Changed back to 1, still working OK. Not sure what gives but hope this thing runs thru the night.

tmcdonald · Post by **tmcdonald** » Tue Apr 08, 2014 3:19 pm

pnewlon wrote:[root@LPNAGV04 var]# uptime
14:31:28 up 1 day, 6:56, 1 user, load average: 12.93, 9.26, 5.53

That's probably your answer. A load over 10 by default will cause perfdata to stop being processed. My guess is that you probably had a load spike starting then that did not go down until recently. You can change this in the /usr/local/nagios/etc/pnp/npcd.cfg file by changing load_threshold to 20.0 or higher.

pnewlon · Post by **pnewlon** » Wed Apr 09, 2014 7:02 am

Thanks! I bumped that up a long time ago because I found it was annoying to have it stop and never catch back up again. load_threshold = 30.0

pnewlon · Post by **pnewlon** » Wed Apr 09, 2014 7:07 am

A different symptom this morning. Service/host checks stopped at 238AM. Yesterday host/service checks stayed running and perfdata stopped. Today just the opposite. Why would there be a fork error that completely stops checks?

tail nagios.log
[1397025488] Warning: The check of service 'Ping' on host '06375_OB01-DSP' could not be performed due to a fork() error: 'Resource temporarily unavailable'. The check will be rescheduled.
[1397025488] Warning: The check of service 'Ping' on host '11278_HTTP' could not be performed due to a fork() error: 'Resource temporarily unavailable'. The check will be rescheduled.

8GB of RAM, 6 in use 2 free. Swap file: 2 GB, 0k used. I'm guessing that memory isn't the problem. All services running

tail npcd.log
[04-09-2014 08:04:22] NPCD: Processing file '1397045052.perfdata.host'
[04-09-2014 08:04:22] NPCD: Processing file '1397045052.perfdata.service'
[04-09-2014 08:04:22] NPCD: No more files to process... waiting for 15 seconds

tail perfdata.log
2014-04-09 08:05:08 [28487] [1] process_perfdata.pl-0.6.11 starting in BULK Mode called by NPCD
2014-04-09 08:05:08 [28487] [1] 0 lines processed
2014-04-09 08:05:08 [28487] [1] /var/nagiosramdisk/spool/perfdata//1397045097.perfdata.host-PID-28487 deleted
2014-04-09 08:05:08 [28488] [1] process_perfdata.pl-0.6.11 starting in BULK Mode called by NPCD
2014-04-09 08:05:08 [28488] [1] 0 lines processed
2014-04-09 08:05:08 [28488] [1] /var/nagiosramdisk/spool/perfdata//1397045097.perfdata.service-PID-28488 deleted
2014-04-09 08:05:08 [28488] [1] PNP exiting (runtime 0.000659s) ...
2014-04-09 08:05:08 [28487] [1] PNP exiting (runtime 0.00062s) ...

[root@LPNAGV04 var]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
7.5G 2.7G 4.5G 38% /
tmpfs 3.9G 0 3.9G 0% /dev/shm
/dev/sda1 485M 50M 410M 11% /boot
/dev/sdb1 79G 4.2G 71G 6% /usr/local
/dev/sdc1 79G 17G 58G 23% /store
tmpfs 750M 35M 716M 5% /var/nagiosramdisk
[root@LPNAGV04 var]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root
494832 86910 407922 18% /
tmpfs 1007596 1 1007595 1% /dev/shm
/dev/sda1 128016 44 127972 1% /boot
/dev/sdb1 5242880 25531 5217349 1% /usr/local
/dev/sdc1 5242880 684 5242196 1% /store
tmpfs 1007596 4821 1002775 1% /var/nagiosramdisk

pnewlon · Post by **pnewlon** » Wed Apr 09, 2014 7:18 am

Stopped and restarted all services:

tail nagios.log
[1397045327] Nagios 3.5.0 starting... (PID=32137)
[1397045327] Local time is Wed Apr 09 08:08:47 EDT 2014
[1397045327] LOG VERSION: 2.0
[1397045327] ndomod: NDOMOD 1.5.2 (06-08-2012) Copyright (c) 2009 Nagios Core Development Team and Community Contributors
[1397045327] ndomod: Could not open data sink! I'll keep trying, but some output may get lost...
[1397045327] Event broker module '/usr/local/nagios/bin/ndomod.o' initialized successfully.
[1397045330] Finished daemonizing... (New PID=32138)
[1397045343] ndomod: Successfully connected to data sink. 22102 items lost, 5000 queued items to flush.
[1397045343] ndomod: Successfully flushed 5000 queued items to data sink.

tail npcd.log
[04-09-2014 08:16:19] NPCD: No more files to process... waiting for 15 seconds
[04-09-2014 08:16:34] NPCD: Processing file '1397045781.perfdata.host'
[04-09-2014 08:16:34] NPCD: Processing file '1397045781.perfdata.service'
[04-09-2014 08:16:35] NPCD: No more files to process... waiting for 15 seconds
[04-09-2014 08:16:50] NPCD: Processing file '1397045799.perfdata.host'
[04-09-2014 08:16:50] NPCD: Processing file '1397045799.perfdata.service'
[04-09-2014 08:16:51] NPCD: No more files to process... waiting for 15 seconds
[04-09-2014 08:17:06] NPCD: Processing file '1397045816.perfdata.host'
[04-09-2014 08:17:06] NPCD: Processing file '1397045816.perfdata.service'
[04-09-2014 08:17:06] NPCD: No more files to process... waiting for 15 seconds

tail perfdata.log
2014-04-09 08:17:22 [21284] [1] Found Performance Data for 00860_OB04_DSP / Ping (rta=79.949997ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0)
2014-04-09 08:17:22 [21284] [1] Found Performance Data for 00875_ENV01 / Ping (rta=90.785004ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0)
2014-04-09 08:17:22 [21284] [1] Found Performance Data for 00875_IB01-DSP / Ping (rta=85.039001ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0)
2014-04-09 08:17:22 [21284] [1] Found Performance Data for 00875_IB02-MP / Ping (rta=80.989998ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0)
2014-04-09 08:17:22 [21284] [1] Found Performance Data for 00875_IB03-AMT / Ping (rta=87.903000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0)
2014-04-09 08:17:22 [21284] [1] Found Performance Data for 00875_IB04-DSP / Ping (rta=76.797997ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0)
2014-04-09 08:17:22 [21284] [1] Found Performance Data for 00851_IB11-DSP / NECtemp2F (tempF=98.6)
2014-04-09 08:17:22 [21284] [1] 138 lines processed
2014-04-09 08:17:22 [21284] [1] /var/nagiosramdisk/spool/perfdata//1397045826.perfdata.service-PID-21284 deleted
2014-04-09 08:17:22 [21284] [1] PNP exiting (runtime 0.211911s) ...

abrist · Post by **abrist** » Wed Apr 09, 2014 12:39 pm

Fork/resource errors are usually caused by system ulimits or kernel msg limits. See the following FAQ entry:
http://support.nagios.com/wiki/index.ph ... g_Orphaned

pnewlon · Post by **pnewlon** » Wed Apr 09, 2014 2:40 pm

# ulimit -a

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62829
max locked memory (kbytes, -l) 128
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 20480
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Post by **lmiltchev** » Thu Apr 10, 2014 10:17 am

Do you have any backed up perfdata files? Run the following commands and show the output:

Code: Select all

ls /usr/local/nagios/var/spool/xidpe | wc -l
ls /usr/local/nagios/var/spool/perfdata | wc -l
ls /usr/local/nagios/var/spool/checkresults | wc -l

pnewlon · Post by **pnewlon** » Fri Apr 11, 2014 1:01 pm

@lmiltchev - no, none were being generated...

since changing the ulimits the system has run flawlessly. I think this can be closed now. Thanks!

Nagios Support Forum

performance data stopped again

performance data stopped again

Re: performance data stopped again

Re: performance data stopped again

Re: performance data stopped again

Re: performance data stopped again

Re: performance data stopped again

Re: performance data stopped again

Re: performance data stopped again

Re: performance data stopped again

Re: performance data stopped again