Page 1 of 4

Halp! All My Graphs have Stopped!

Posted: Mon Jun 30, 2014 9:07 am
by BenGatewood
As per the subject, all my graphs have stopped updating since 10:06 this morning. I thought it might be related to the ":" in the service name bug but it's effecting services without colons too. I have installed the pnp.zip component in any case. My event log from the time is filled with messages like this:

wproc: 'Core Worker 18702' seems to be choked. ret = -1; bufsize = 5008: errno = 11 (Resource temporarily unavailable)

Any ideas?

Ben

Re: Halp! All My Graphs have Stopped!

Posted: Mon Jun 30, 2014 9:32 am
by slansing
Hmm, what version of XI 2014 are you running? Can you post the output of:

Code: Select all

tail -30 /usr/local/nagios/var/perfdata.log
tail -30 /usr/local/nagios/var/npcd.log
tail -30 /usr/local/nagios/var/nagios.log

Code: Select all

ll /usr/local/nagios/var/spool/xidpe/ | wc -l
ll /usr/local/nagios/var/spool/perfdata/ | wc -l

Code: Select all

df -h

df -i

Code: Select all

top | head -10

Re: Halp! All My Graphs have Stopped!

Posted: Mon Jun 30, 2014 9:36 am
by BenGatewood
Versions is R1.2. Output:

tail -30 /usr/local/nagios/var/perfdata.log
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119035.perfdata.service-PID-18124 deleted
2014-06-30 14:09:49 [18124] [0] *** Timeout while processing Host: "METROE06.THW" Service: "1_9_EOG.OLB_-P_Bandwidth"
2014-06-30 14:09:49 [18124] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119066.perfdata.service-PID-18773 deleted
2014-06-30 14:09:55 [18773] [0] *** Timeout while processing Host: "SW01.COR.EOG" Service: "CPU_Total_Utilisation"
2014-06-30 14:09:55 [18773] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119051.perfdata.service-PID-18771 deleted
2014-06-30 14:09:55 [18771] [0] *** Timeout while processing Host: "SW03.LOM.EOG" Service: "Slot_1_State"
2014-06-30 14:09:55 [18771] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119064.perfdata.host-PID-18772 deleted
2014-06-30 14:09:55 [18772] [0] *** Timeout while processing Host: "as-01" Service: "_HOST_"
2014-06-30 14:09:55 [18772] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119095.perfdata.service-PID-20601 deleted
2014-06-30 14:10:00 [20601] [0] *** Timeout while processing Host: "PBDEMUX04.EQX" Service: "Port_1016_Bandwidth"
2014-06-30 14:10:00 [20601] [0] *** process_perfdata.pl terminated on signal ALRM

ail -30 /usr/local/nagios/var/npcd.log
[06-30-2014 15:26:44] NPCD: WARN: MAX load reached: load 17.940000/10.000000 at i=37
[06-30-2014 15:26:59] NPCD: WARN: MAX load reached: load 15.410000/10.000000 at i=37
[06-30-2014 15:27:14] NPCD: WARN: MAX load reached: load 15.990000/10.000000 at i=37
[06-30-2014 15:27:29] NPCD: WARN: MAX load reached: load 22.300000/10.000000 at i=37
[06-30-2014 15:27:44] NPCD: WARN: MAX load reached: load 20.970000/10.000000 at i=37
[06-30-2014 15:27:59] NPCD: WARN: MAX load reached: load 25.140000/10.000000 at i=37
[06-30-2014 15:28:14] NPCD: WARN: MAX load reached: load 26.190000/10.000000 at i=37
[06-30-2014 15:28:29] NPCD: WARN: MAX load reached: load 23.330000/10.000000 at i=37
[06-30-2014 15:28:45] NPCD: WARN: MAX load reached: load 27.470000/10.000000 at i=37
[06-30-2014 15:29:00] NPCD: WARN: MAX load reached: load 38.450000/10.000000 at i=37
[06-30-2014 15:29:15] NPCD: WARN: MAX load reached: load 45.890000/10.000000 at i=37
[06-30-2014 15:29:31] NPCD: WARN: MAX load reached: load 46.620000/10.000000 at i=37
[06-30-2014 15:29:46] NPCD: WARN: MAX load reached: load 45.610000/10.000000 at i=37
[06-30-2014 15:30:01] NPCD: WARN: MAX load reached: load 44.370000/10.000000 at i=37
[06-30-2014 15:30:16] NPCD: WARN: MAX load reached: load 52.290000/10.000000 at i=37
[06-30-2014 15:30:32] NPCD: WARN: MAX load reached: load 52.380000/10.000000 at i=37
[06-30-2014 15:30:47] NPCD: WARN: MAX load reached: load 47.350000/10.000000 at i=37
[06-30-2014 15:31:02] NPCD: WARN: MAX load reached: load 43.810000/10.000000 at i=37
[06-30-2014 15:31:17] NPCD: WARN: MAX load reached: load 37.350000/10.000000 at i=37
[06-30-2014 15:31:32] NPCD: WARN: MAX load reached: load 31.270000/10.000000 at i=37
[06-30-2014 15:31:47] NPCD: WARN: MAX load reached: load 25.940000/10.000000 at i=37
[06-30-2014 15:32:02] NPCD: WARN: MAX load reached: load 22.960000/10.000000 at i=37
[06-30-2014 15:32:17] NPCD: WARN: MAX load reached: load 23.050000/10.000000 at i=37
[06-30-2014 15:32:32] NPCD: WARN: MAX load reached: load 28.850000/10.000000 at i=37
[06-30-2014 15:32:47] NPCD: WARN: MAX load reached: load 26.180000/10.000000 at i=37
[06-30-2014 15:33:04] NPCD: WARN: MAX load reached: load 30.130000/10.000000 at i=37
[06-30-2014 15:33:19] NPCD: WARN: MAX load reached: load 27.910000/10.000000 at i=37
[06-30-2014 15:33:34] NPCD: WARN: MAX load reached: load 26.750000/10.000000 at i=37
[06-30-2014 15:33:49] NPCD: WARN: MAX load reached: load 24.160000/10.000000 at i=37
[06-30-2014 15:34:04] NPCD: WARN: MAX load reached: load 28.560000/10.000000 at i=37

tail -30 /usr/local/nagios/var/nagios.log
[1404138822] SERVICE NOTIFICATION: nagiosadmin;PEER02.EQX;Port 532 Status;CRITICAL;xi_service_notification_handler;CRITICAL: Interface ge-1/1/6 (index 532) is down.
[1404138822] SERVICE NOTIFICATION: clarkb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: gatewoodb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: perkinsb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: sadlerb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: nagiosadmin;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;xi_service_notification_handler;CRITICAL - table cache hitrate 9.79%
[1404138838] HOST NOTIFICATION: gatewoodb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: perkinsb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: sadlerb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: clarkb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: nagiosadmin;APC01.TS.ES;DOWN;xi_host_notification_handler;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138845] SERVICE NOTIFICATION: nagiosadmin;palladion.eqx;Swap Usage;CRITICAL;xi_service_notification_handler;ERROR: Description/Type table : No response from remote host '10.100.6.160'.
[1404138845] SERVICE NOTIFICATION: clarkb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: gatewoodb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: perkinsb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: sadlerb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: nagiosadmin;APC01.KNG.EOG;Output Load;WARNING;xi_service_notification_handler;SNMP WARNING - *85*
[1404138851] SERVICE ALERT: APC01.PRK.EOG;Input Line Voltage;WARNING;SOFT;1;SNMP WARNING - *239*
[1404138853] SERVICE NOTIFICATION: clarkb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: gatewoodb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: perkinsb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: sadlerb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: nagiosadmin;orkweb01;MySQL Index Usage;CRITICAL;xi_service_notification_handler;CRITICAL - index usage 59.35%
[1404138863] SERVICE NOTIFICATION: nagiosadmin;PEER02.THW;Port 528 Status;CRITICAL;xi_service_notification_handler;CRITICAL: Interface ge-1/1/2 (index 528) is down.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;Slot 8 State;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;Software Revision to use on Boot;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;sysUpTimeInstance;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;Primary Software Revision;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] HOST FLAPPING ALERT: SW02.CEN.EOG;STOPPED; Host appears to have stopped flapping (4.7% change < 5.0% threshold)
[1404138867] HOST NOTIFICATION: nagiosadmin;SW02.CEN.EOG;FLAPPINGSTOP (UP);xi_host_notification_handler;OK - 10.33.10.12: rta 23.372ms, lost 0%


ll /usr/local/nagios/var/spool/xidpe/ | wc -l
1
ll /usr/local/nagios/var/spool/perfdata/ | wc -l
2546

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root 50G 17G 31G 35% /
tmpfs 7.7G 0 7.7G 0% /dev/shm
/dev/xvda1 485M 34M 426M 8% /boot
/dev/mapper/VolGroup-lv_home 40G 176M 38G 1% /home

df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root 3276800 167495 3109305 6% /
tmpfs 1992341 1 1992340 1% /dev/shm
/dev/xvda1 128016 38 127978 1% /boot
/dev/mapper/VolGroup-lv_home 2613248 17 2613231 1% /home

Re: Halp! All My Graphs have Stopped!

Posted: Mon Jun 30, 2014 11:30 am
by slansing
Looks like you are running into NPCD's load threshold, you will need to either decrease your system's load, or increase the npcd setting in:

Code: Select all

/usr/local/nagios/etc/pnp/npcd.cfg
You need to increase:

Code: Select all

load_threshold = 10.0
Then save the file and:

Code: Select all

service npcd restart

Re: Halp! All My Graphs have Stopped!

Posted: Tue Jul 01, 2014 3:37 am
by BenGatewood
OK. I have done both these things: I have increased the thresholds for NPCD and the perfdata timeout (as I was seeing timeouts in the log) and I have also lowered the number and frequency of checks on the server overall but I now have a load of switch interfaces reporting zero bandwidth (which I know is incorrect). Do I have to wait for the server to 'catch up' on processing something? What can I check to troubleshoot?

Re: Halp! All My Graphs have Stopped!

Posted: Tue Jul 01, 2014 4:22 am
by BenGatewood
I think I have some serious problems with this now. I have added some new switch ports which are continually complaining about their RRDs being missing despite the fact they are present in the directory :-/

Re: Halp! All My Graphs have Stopped!

Posted: Tue Jul 01, 2014 9:05 am
by BenGatewood
Still having troubles - still seeing these log entries:

2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: Timeout after 80 secs. ***
2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: Please check your npcd.cfg
2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404218088.perfdata.service-PID-10072 deleted
2014-07-01 15:01:52 [10072] [0] *** Timeout while processing Host: "SW01.COR.EOG" Service: "Port_2048_Bandwidth"
2014-07-01 15:01:52 [10072] [0] *** process_perfdata.pl terminated on signal ALRM

I have wound out load thresholds and timeouts and increased the NPCD thread count from 5 to 10. What else can I do?

Re: Halp! All My Graphs have Stopped!

Posted: Tue Jul 01, 2014 10:12 am
by slansing
Can we get another run of:

Code: Select all

ll /usr/local/nagios/var/spool/perfdata/ | wc -l
What did you bump your load threshold to? You didn't mention. What is going to happen is your server is going to use the maximum load that you prescribe there until the files are processed. It might be that we need to remove the old data in that perfdata directory to jump start the processing as it could be clogged now.

Re: Halp! All My Graphs have Stopped!

Posted: Tue Jul 01, 2014 10:15 am
by sreinhardt
Let's look at how many files it is currently trying to process. While increasing timeout and spacing the reaping of perfdata is great, it can lead to too much perfdata being processed at one time.

Code: Select all

ls -l /usr/local/nagios/var/spool/xidpe | wc -l
ls -l /usr/local/nagios/var/spool/perfdata | wc -l
ls -l /usr/local/nagios/var/spool/checkresults | wc -l

Re: Halp! All My Graphs have Stopped!

Posted: Tue Jul 01, 2014 10:27 am
by BenGatewood
ls -l /usr/local/nagios/var/spool/xidpe | wc -l
2
ls -l /usr/local/nagios/var/spool/perfdata | wc -l
602
ls -l /usr/local/nagios/var/spool/checkresults | wc -l
1


This is basically a new install so I'm happy to clear out unprocessed perfdata if it gets it going again properly.

I was reading another thread (http://support.nagios.com/forum/viewtop ... 6&start=20) and think I may also have some MRTG orphans but I don't know if they are a cause or a symptom.


As part of that, running this command:

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg –check

Takes a looooooooooong time. Not sure if that's relevant or not.