Halp! All My Graphs have Stopped!

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Halp! All My Graphs have Stopped!

Post by BenGatewood »

As per the subject, all my graphs have stopped updating since 10:06 this morning. I thought it might be related to the ":" in the service name bug but it's effecting services without colons too. I have installed the pnp.zip component in any case. My event log from the time is filled with messages like this:

wproc: 'Core Worker 18702' seems to be choked. ret = -1; bufsize = 5008: errno = 11 (Resource temporarily unavailable)

Any ideas?

Ben
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Halp! All My Graphs have Stopped!

Post by slansing »

Hmm, what version of XI 2014 are you running? Can you post the output of:

Code: Select all

tail -30 /usr/local/nagios/var/perfdata.log
tail -30 /usr/local/nagios/var/npcd.log
tail -30 /usr/local/nagios/var/nagios.log

Code: Select all

ll /usr/local/nagios/var/spool/xidpe/ | wc -l
ll /usr/local/nagios/var/spool/perfdata/ | wc -l

Code: Select all

df -h

df -i

Code: Select all

top | head -10
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

Versions is R1.2. Output:

tail -30 /usr/local/nagios/var/perfdata.log
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:49 [18124] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119035.perfdata.service-PID-18124 deleted
2014-06-30 14:09:49 [18124] [0] *** Timeout while processing Host: "METROE06.THW" Service: "1_9_EOG.OLB_-P_Bandwidth"
2014-06-30 14:09:49 [18124] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:55 [18773] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119066.perfdata.service-PID-18773 deleted
2014-06-30 14:09:55 [18773] [0] *** Timeout while processing Host: "SW01.COR.EOG" Service: "CPU_Total_Utilisation"
2014-06-30 14:09:55 [18773] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:55 [18771] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119051.perfdata.service-PID-18771 deleted
2014-06-30 14:09:55 [18771] [0] *** Timeout while processing Host: "SW03.LOM.EOG" Service: "Slot_1_State"
2014-06-30 14:09:55 [18771] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:09:55 [18772] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119064.perfdata.host-PID-18772 deleted
2014-06-30 14:09:55 [18772] [0] *** Timeout while processing Host: "as-01" Service: "_HOST_"
2014-06-30 14:09:55 [18772] [0] *** process_perfdata.pl terminated on signal ALRM
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: Timeout after 5 secs. ***
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: Please check your npcd.cfg
2014-06-30 14:10:00 [20601] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404119095.perfdata.service-PID-20601 deleted
2014-06-30 14:10:00 [20601] [0] *** Timeout while processing Host: "PBDEMUX04.EQX" Service: "Port_1016_Bandwidth"
2014-06-30 14:10:00 [20601] [0] *** process_perfdata.pl terminated on signal ALRM

ail -30 /usr/local/nagios/var/npcd.log
[06-30-2014 15:26:44] NPCD: WARN: MAX load reached: load 17.940000/10.000000 at i=37
[06-30-2014 15:26:59] NPCD: WARN: MAX load reached: load 15.410000/10.000000 at i=37
[06-30-2014 15:27:14] NPCD: WARN: MAX load reached: load 15.990000/10.000000 at i=37
[06-30-2014 15:27:29] NPCD: WARN: MAX load reached: load 22.300000/10.000000 at i=37
[06-30-2014 15:27:44] NPCD: WARN: MAX load reached: load 20.970000/10.000000 at i=37
[06-30-2014 15:27:59] NPCD: WARN: MAX load reached: load 25.140000/10.000000 at i=37
[06-30-2014 15:28:14] NPCD: WARN: MAX load reached: load 26.190000/10.000000 at i=37
[06-30-2014 15:28:29] NPCD: WARN: MAX load reached: load 23.330000/10.000000 at i=37
[06-30-2014 15:28:45] NPCD: WARN: MAX load reached: load 27.470000/10.000000 at i=37
[06-30-2014 15:29:00] NPCD: WARN: MAX load reached: load 38.450000/10.000000 at i=37
[06-30-2014 15:29:15] NPCD: WARN: MAX load reached: load 45.890000/10.000000 at i=37
[06-30-2014 15:29:31] NPCD: WARN: MAX load reached: load 46.620000/10.000000 at i=37
[06-30-2014 15:29:46] NPCD: WARN: MAX load reached: load 45.610000/10.000000 at i=37
[06-30-2014 15:30:01] NPCD: WARN: MAX load reached: load 44.370000/10.000000 at i=37
[06-30-2014 15:30:16] NPCD: WARN: MAX load reached: load 52.290000/10.000000 at i=37
[06-30-2014 15:30:32] NPCD: WARN: MAX load reached: load 52.380000/10.000000 at i=37
[06-30-2014 15:30:47] NPCD: WARN: MAX load reached: load 47.350000/10.000000 at i=37
[06-30-2014 15:31:02] NPCD: WARN: MAX load reached: load 43.810000/10.000000 at i=37
[06-30-2014 15:31:17] NPCD: WARN: MAX load reached: load 37.350000/10.000000 at i=37
[06-30-2014 15:31:32] NPCD: WARN: MAX load reached: load 31.270000/10.000000 at i=37
[06-30-2014 15:31:47] NPCD: WARN: MAX load reached: load 25.940000/10.000000 at i=37
[06-30-2014 15:32:02] NPCD: WARN: MAX load reached: load 22.960000/10.000000 at i=37
[06-30-2014 15:32:17] NPCD: WARN: MAX load reached: load 23.050000/10.000000 at i=37
[06-30-2014 15:32:32] NPCD: WARN: MAX load reached: load 28.850000/10.000000 at i=37
[06-30-2014 15:32:47] NPCD: WARN: MAX load reached: load 26.180000/10.000000 at i=37
[06-30-2014 15:33:04] NPCD: WARN: MAX load reached: load 30.130000/10.000000 at i=37
[06-30-2014 15:33:19] NPCD: WARN: MAX load reached: load 27.910000/10.000000 at i=37
[06-30-2014 15:33:34] NPCD: WARN: MAX load reached: load 26.750000/10.000000 at i=37
[06-30-2014 15:33:49] NPCD: WARN: MAX load reached: load 24.160000/10.000000 at i=37
[06-30-2014 15:34:04] NPCD: WARN: MAX load reached: load 28.560000/10.000000 at i=37

tail -30 /usr/local/nagios/var/nagios.log
[1404138822] SERVICE NOTIFICATION: nagiosadmin;PEER02.EQX;Port 532 Status;CRITICAL;xi_service_notification_handler;CRITICAL: Interface ge-1/1/6 (index 532) is down.
[1404138822] SERVICE NOTIFICATION: clarkb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: gatewoodb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: perkinsb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: sadlerb;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;service-html-email;CRITICAL - table cache hitrate 9.79%
[1404138822] SERVICE NOTIFICATION: nagiosadmin;orkweb01;MySQL Table Cache Hit Rate;CRITICAL;xi_service_notification_handler;CRITICAL - table cache hitrate 9.79%
[1404138838] HOST NOTIFICATION: gatewoodb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: perkinsb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: sadlerb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: clarkb;APC01.TS.ES;DOWN;host-html-email;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138838] HOST NOTIFICATION: nagiosadmin;APC01.TS.ES;DOWN;xi_host_notification_handler;CRITICAL - 172.16.100.11: rta nan, lost 100%
[1404138845] SERVICE NOTIFICATION: nagiosadmin;palladion.eqx;Swap Usage;CRITICAL;xi_service_notification_handler;ERROR: Description/Type table : No response from remote host '10.100.6.160'.
[1404138845] SERVICE NOTIFICATION: clarkb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: gatewoodb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: perkinsb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: sadlerb;APC01.KNG.EOG;Output Load;WARNING;service-html-email;SNMP WARNING - *85*
[1404138845] SERVICE NOTIFICATION: nagiosadmin;APC01.KNG.EOG;Output Load;WARNING;xi_service_notification_handler;SNMP WARNING - *85*
[1404138851] SERVICE ALERT: APC01.PRK.EOG;Input Line Voltage;WARNING;SOFT;1;SNMP WARNING - *239*
[1404138853] SERVICE NOTIFICATION: clarkb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: gatewoodb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: perkinsb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: sadlerb;orkweb01;MySQL Index Usage;CRITICAL;service-html-email;CRITICAL - index usage 59.35%
[1404138853] SERVICE NOTIFICATION: nagiosadmin;orkweb01;MySQL Index Usage;CRITICAL;xi_service_notification_handler;CRITICAL - index usage 59.35%
[1404138863] SERVICE NOTIFICATION: nagiosadmin;PEER02.THW;Port 528 Status;CRITICAL;xi_service_notification_handler;CRITICAL: Interface ge-1/1/2 (index 528) is down.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;Slot 8 State;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;Software Revision to use on Boot;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;sysUpTimeInstance;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] SERVICE ALERT: SW02.CEN.EOG;Primary Software Revision;UNKNOWN;SOFT;1;External command error: Timeout: No Response from 10.33.10.12:161.
[1404138867] HOST FLAPPING ALERT: SW02.CEN.EOG;STOPPED; Host appears to have stopped flapping (4.7% change < 5.0% threshold)
[1404138867] HOST NOTIFICATION: nagiosadmin;SW02.CEN.EOG;FLAPPINGSTOP (UP);xi_host_notification_handler;OK - 10.33.10.12: rta 23.372ms, lost 0%


ll /usr/local/nagios/var/spool/xidpe/ | wc -l
1
ll /usr/local/nagios/var/spool/perfdata/ | wc -l
2546

df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root 50G 17G 31G 35% /
tmpfs 7.7G 0 7.7G 0% /dev/shm
/dev/xvda1 485M 34M 426M 8% /boot
/dev/mapper/VolGroup-lv_home 40G 176M 38G 1% /home

df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root 3276800 167495 3109305 6% /
tmpfs 1992341 1 1992340 1% /dev/shm
/dev/xvda1 128016 38 127978 1% /boot
/dev/mapper/VolGroup-lv_home 2613248 17 2613231 1% /home
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Halp! All My Graphs have Stopped!

Post by slansing »

Looks like you are running into NPCD's load threshold, you will need to either decrease your system's load, or increase the npcd setting in:

Code: Select all

/usr/local/nagios/etc/pnp/npcd.cfg
You need to increase:

Code: Select all

load_threshold = 10.0
Then save the file and:

Code: Select all

service npcd restart
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

OK. I have done both these things: I have increased the thresholds for NPCD and the perfdata timeout (as I was seeing timeouts in the log) and I have also lowered the number and frequency of checks on the server overall but I now have a load of switch interfaces reporting zero bandwidth (which I know is incorrect). Do I have to wait for the server to 'catch up' on processing something? What can I check to troubleshoot?
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

I think I have some serious problems with this now. I have added some new switch ports which are continually complaining about their RRDs being missing despite the fact they are present in the directory :-/
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

Still having troubles - still seeing these log entries:

2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: Timeout after 80 secs. ***
2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: Deleting current file to avoid NPCD loops
2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: Please check your npcd.cfg
2014-07-01 15:01:52 [10072] [0] *** TIMEOUT: /usr/local/nagios/var/spool/perfdata//1404218088.perfdata.service-PID-10072 deleted
2014-07-01 15:01:52 [10072] [0] *** Timeout while processing Host: "SW01.COR.EOG" Service: "Port_2048_Bandwidth"
2014-07-01 15:01:52 [10072] [0] *** process_perfdata.pl terminated on signal ALRM

I have wound out load thresholds and timeouts and increased the NPCD thread count from 5 to 10. What else can I do?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Halp! All My Graphs have Stopped!

Post by slansing »

Can we get another run of:

Code: Select all

ll /usr/local/nagios/var/spool/perfdata/ | wc -l
What did you bump your load threshold to? You didn't mention. What is going to happen is your server is going to use the maximum load that you prescribe there until the files are processed. It might be that we need to remove the old data in that perfdata directory to jump start the processing as it could be clogged now.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Halp! All My Graphs have Stopped!

Post by sreinhardt »

Let's look at how many files it is currently trying to process. While increasing timeout and spacing the reaping of perfdata is great, it can lead to too much perfdata being processed at one time.

Code: Select all

ls -l /usr/local/nagios/var/spool/xidpe | wc -l
ls -l /usr/local/nagios/var/spool/perfdata | wc -l
ls -l /usr/local/nagios/var/spool/checkresults | wc -l
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
BenGatewood
Posts: 35
Joined: Fri May 16, 2014 5:17 am

Re: Halp! All My Graphs have Stopped!

Post by BenGatewood »

ls -l /usr/local/nagios/var/spool/xidpe | wc -l
2
ls -l /usr/local/nagios/var/spool/perfdata | wc -l
602
ls -l /usr/local/nagios/var/spool/checkresults | wc -l
1


This is basically a new install so I'm happy to clear out unprocessed perfdata if it gets it going again properly.

I was reading another thread (http://support.nagios.com/forum/viewtop ... 6&start=20) and think I may also have some MRTG orphans but I don't know if they are a cause or a symptom.


As part of that, running this command:

LANG=C LC_ALL=C /usr/bin/mrtg /etc/mrtg/mrtg.cfg –check

Takes a looooooooooong time. Not sure if that's relevant or not.
Locked