Nagios stops working
Posted: Thu Feb 11, 2016 6:13 am
Hi guys,
we have a problem with one of our nagios instances.
Its a weird behavior ... When nagios is started everything works fine but after some time there are no service checks executed anymore. The Nagios process is still running and takes 100% cpu load on 1 Core.
So we enabled debugging and looked deeper in this. We can see that soon as the behavior starts nagios is only executing the "Event Check Loop" anymore and this a lot! The logfile grows for 100MB / minute.
In this particular example the last servicecheck was executed on 10-02-2016 23:43:26 and i was watching the entry on Thu Feb 11 09:44:01 2016
Theses entries are repeated all the time and are the only information in the debug file - no other messages in there.
I checked google but found only one message regarding the same problem: http://permalink.gmane.org/gmane.networ ... user/73941
I found no answer to this problem.
We try to get a log where we have some entries before the problem occurs (Which is not so easy because of the amount of data being written)... maybe we find the problem there
Do you have an idea what to do in order to fix this problem?
Some information regarding this nagios instance:
we have a problem with one of our nagios instances.
Its a weird behavior ... When nagios is started everything works fine but after some time there are no service checks executed anymore. The Nagios process is still running and takes 100% cpu load on 1 Core.
So we enabled debugging and looked deeper in this. We can see that soon as the behavior starts nagios is only executing the "Event Check Loop" anymore and this a lot! The logfile grows for 100MB / minute.
In this particular example the last servicecheck was executed on 10-02-2016 23:43:26 and i was watching the entry on Thu Feb 11 09:44:01 2016
So the "Next Low Priority Event Time:" is always in the past and has the date from when nagios stopped executing service checks.[1455180237.632369] [008.1] [pid=5503] ** Event Check Loop
[1455180237.632373] [008.1] [pid=5503] Next High Priority Event Time: Thu Feb 11 09:44:01 2016
[1455180237.632377] [008.1] [pid=5503] Next Low Priority Event Time: Wed Feb 10 23:42:46 2016
[1455180237.632380] [008.1] [pid=5503] Current/Max Service Checks: 0/0
[1455180237.632382] [024.1] [pid=5503] We're not executing host checks right now, so we'll skip this event.
[1455180237.632384] [001.0] [pid=5503] remove_event()
[1455180237.632386] [064.1] [pid=5503] Making callbacks (type 8)...
[1455180237.632388] [001.0] [pid=5503] reschedule_event()
[1455180237.632390] [001.0] [pid=5503] add_event()
[1455180237.632392] [064.1] [pid=5503] Making callbacks (type 8)...
[1455180237.632400] [064.1] [pid=5503] Making callbacks (type 19)...
Theses entries are repeated all the time and are the only information in the debug file - no other messages in there.
I checked google but found only one message regarding the same problem: http://permalink.gmane.org/gmane.networ ... user/73941
I found no answer to this problem.
We try to get a log where we have some entries before the problem occurs (Which is not so easy because of the amount of data being written)... maybe we find the problem there
Do you have an idea what to do in order to fix this problem?
Some information regarding this nagios instance:
SLES 11 SP4 (Linux xxx 3.0.101-68-default #1 SMP Tue Dec 1 16:21:37 UTC 2015 (ed01a9f) x86_64 x86_64 x86_64 GNU/Linux)
Nagios Stats 3.5.1
Copyright (c) 2003-2008 Ethan Galstad (http://www.nagios.org)
Last Modified: 08-30-2013
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /dev/shm/status.dat
Status File Age: 0d 0h 0m 4s
Status File Version: 3.5.1
Program Running Time: 0d 0h 21m 34s
Nagios PID: 56643
Used/High/Total Command Buffers: 0 / 0 / 4096
Total Services: 2280
Services Checked: 2237
Services Scheduled: 2191
Services Actively Checked: 2280
Services Passively Checked: 0
Total Service State Change: 0.000 / 26.640 / 0.080 %
Active Service Latency: 0.000 / 0.419 / 0.133 sec
Active Service Execution Time: 0.000 / 123.864 / 1.444 sec
Active Service State Change: 0.000 / 26.640 / 0.080 %
Active Services Last 1/5/15/60 min: 355 / 1598 / 1987 / 2140
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 2196 / 30 / 2 / 52
Services Flapping: 0
Services In Downtime: 0
Total Hosts: 285
Hosts Checked: 285
Hosts Scheduled: 0
Hosts Actively Checked: 284
Host Passively Checked: 1
Total Host State Change: 0.000 / 10.260 / 0.036 %
Active Host Latency: 0.000 / 16.771 / 0.199 sec
Active Host Execution Time: 0.133 / 29.189 / 0.826 sec
Active Host State Change: 0.000 / 10.260 / 0.036 %
Active Hosts Last 1/5/15/60 min: 1 / 1 / 3 / 11
Passive Host Latency: 0.519 / 0.519 / 0.519 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 284 / 1 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 4 / 45 / 168
Scheduled: 0 / 0 / 0
On-demand: 4 / 45 / 168
Parallel: 0 / 1 / 5
Serial: 0 / 0 / 0
Cached: 3 / 44 / 163
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 358 / 1668 / 5070
Scheduled: 358 / 1668 / 5070
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
External Commands Last 1/5/15 min: 0 / 0 / 0
[1455187515] Nagios 3.5.1 starting... (PID=56642)
[1455187515] Local time is Thu Feb 11 11:45:15 CET 2016
[1455187515] LOG VERSION: 2.0
[1455187515] livestatus: Livestatus 1.2.4p4 by Mathias Kettner. Socket: '/tmp/.watchit.livestatus'
[1455187515] livestatus: Please visit us at http://mathias-kettner.de/
[1455187515] livestatus: Hint: please try out OMD - the Open Monitoring Distribution
[1455187515] livestatus: Please visit OMD at http://omdistro.org
[1455187515] livestatus: Finished initialization. Further log messages go to /opt/nagios/var/livestatus.log
[1455187515] Event broker module '/opt/nagios/lib/mk-livestatus/livestatus.o' initialized successfully.
[1455187515] Finished daemonizing... (New PID=56643)