Page 1 of 2
Nagios Stops Executing Checks
Posted: Wed Jan 08, 2014 7:45 am
by Lateralus
Hi folks,
I'm an experienced Nagios Core admin, and Linux Engineer, but lately I'm facing an unexplainable issue with Nagios Core version 3.5.1 running on CentOS 6.4 and I need your help.
I've scorched the web for answers but couldn't find any, and also spent more than 12 hours trying to understand it so I'm not here to waste anyone's time with basic configuration issues.
For the record, our company has 3 Nagios instances worldwide, all on the same version, same hardware. 2 of them are working fine, but the third (which has the least amount of hosts and services defined) stops executing host&service checks periodically. It works well for approx. 2-3 days, and then all of a sudden stops.
I know it stops working since no data is populated in the PNP graphs, and on the Nagios UI you can see the "Last Time Checked" is 12 hours adrift..
I've read online that this version had log rotation bug that caused the engine to stops functioning once the log is rotated, so I configured in nagios.cfg "log_rotation_method=n"
A few technical specs:
* We're using the 'livestatus' module to integrate Nagios with Thruk (unified dashboard).
* The Nagios engine/service is up and running the whole time, even when checks are not performed.
* 'dmesg' command reveals only little regarding the 'check_nt' plugin, but I don't know if that's related:
check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]
* Nagios' log file contains the following lines:
[02-01-2014 06:30:08] livestatus: error: Client connection terminated while request still incomplete
[02-01-2014 06:30:08] livestatus: Timeout while reading query
[02-01-2014 06:22:51] Auto-save of retention data completed successfully.
Let me know if you need any other technical details in order to assist.
Thanks much for your time and efforts,
BR
Ido
Re: Nagios Stops Executing Checks
Posted: Wed Jan 08, 2014 3:40 pm
by abrist
Lateralus wrote:check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]
Depending on how this segfault effects the nagios fork(), you could have an ever growing number of zombie/stuck processes. This is solved by compiling check_nt from the newer nagios-plugin sources. Lets check for zombies - you can use "top" or the following awk:
Code: Select all
ps aux | awk '{ print $8 " " $2 }' | grep -w Z
Your issues could be related to livestatus (we have seen it in the past). How are broker directives configured?
Code: Select all
grep broker /usr/local/nagios/etc/nagios.cfg
Looking at your graphs for the problematic nagios server, do you notice any metric spikes right before failure?
Re: Nagios Stops Executing Checks
Posted: Thu Jan 09, 2014 6:24 am
by Lateralus
abrist wrote:Lateralus wrote:check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]
Depending on how this segfault effects the nagios fork(), you could have an ever growing number of zombie/stuck processes. This is solved by compiling check_nt from the newer nagios-plugin sources. Lets check for zombies - you can use "top" or the following awk:
Code: Select all
ps aux | awk '{ print $8 " " $2 }' | grep -w Z
Your issues could be related to livestatus (we have seen it in the past). How are broker directives configured?
Code: Select all
grep broker /usr/local/nagios/etc/nagios.cfg
Looking at your graphs for the problematic nagios server, do you notice any metric spikes right before failure?
Thanks for the reply, abrist!
Damn, I now noticed my 'check_nt' version is 1.4.13, which I thought I had already upgraded to 1.4.16. I'll do it now.
And yes, looking at the output of 'ps' I can see Nagios zombie processes every once in a while.
I think I now caught the error in action! here's the output of 'ps':
Code: Select all
[root@US-Nagios-LP1 libexec]# ps aux | awk '{ print $8 " " $2 }' | grep -w Z
Z 24360
Z 24361
[root@US-Nagios-LP1 libexec]# ps -ef | grep nagios
root 1511 3787 0 03:18 pts/1 00:00:00 grep nagios
nagios 9607 1 0 Jan08 ? 00:01:56 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 24360 9607 0 Jan08 ? 00:00:00 [nagios] <defunct>
nagios 24361 9607 0 Jan08 ? 00:00:00 [nagios] <defunct>
nagios 24375 9607 0 Jan08 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
As for the broker module, it's configured identically on all Nagios servers we have, and no, there are no spikes on the machine prior to the crash.
Code: Select all
[root@US-Nagios-LP1 libexec]# grep broker ../etc/nagios.cfg
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /usr/local/nagios/var/rw/live
event_broker_options=-1
RIght now the Nagios engine is not performing active checks.
I'll recompile the Nagios plugins and let's hope that helps.
Thanks
Ido
Re: Nagios Stops Executing Checks
Posted: Thu Jan 09, 2014 10:48 am
by slansing
Make sure that you recompile plugins which are compatible with 3.5.1, there were some big changes made to Core since the 4.0 release and the plugins had to be altered to these specifications. We have seen some backwards compatibility issues, but check_nt should be fine.
Re: Nagios Stops Executing Checks
Posted: Thu Jan 09, 2014 2:55 pm
by Lateralus
slansing wrote:Make sure that you recompile plugins which are compatible with 3.5.1, there were some big changes made to Core since the 4.0 release and the plugins had to be altered to these specifications. We have seen some backwards compatibility issues, but check_nt should be fine.
Thanks for the heads up mate.
I've compiled the 1.4.16 version on 2 out of the 3 Nagios instances we have, and they've been working great.
However, I saw there's a new version - 1.5 ; but I didn't see any warnings or reference regarding the version of Nagios it supports.
Do you know if this version should work with my 3.5.1 Nagios?
Thanks
Ido
Re: Nagios Stops Executing Checks
Posted: Thu Jan 09, 2014 3:23 pm
by abrist
Lateralus wrote:Do you know if this version should work with my 3.5.1 Nagios?
All of the above, including 1.5.x.
Re: Nagios Stops Executing Checks
Posted: Mon Feb 03, 2014 11:56 am
by ddoshy
We've been observing a similar issue with 2 of our OMD slave/poller servers, although with the latest version of OMD (1.10), which includes livestatus, pnp4nagios, mod_gearman.
Nagios version is 3.5.0
The Nagios process simply goes away, but the lock file remains. Nagios gives no indication (to our limited understanding) as to what triggered the process termination. We've enabled debugging in nagios, snippets below.
Would appreciate a pointer in the right direction to troubleshoot further or even better to resolve this.
Log snippets
var/log/nagios.log
Code: Select all
[1391434926] Auto-save of retention data completed successfully.
[1391438526] Auto-save of retention data completed successfully.
var/nagios/debug.log
Code: Select all
[1391438526.700324] [008.1] [pid=20221] ** Event Check Loop
[1391438526.700413] [008.1] [pid=20221] Next High Priority Event Time: Mon Feb 3 20:12:07 2014
[1391438526.700421] [008.1] [pid=20221] No low priority events are scheduled...
[1391438526.700427] [008.1] [pid=20221] Current/Max Service Checks: 0/0
[1391438526.700440] [008.2] [pid=20221] No events to execute at the moment. Idling for a bit...
[1391438526.700446] [001.0] [pid=20221] check_for_external_commands()
[1391438526.700453] [064.1] [pid=20221] Making callbacks (type 8)...
[1391438526.700459] [064.2] [pid=20221] Callback #1 (type 8) return code = 0
[1391438526.700468] [064.2] [pid=20221] Callback #2 (type 8) return code = 0
grep broker etc/nagios/nagios.d/
Code: Select all
etc/nagios/nagios.d/mk-livestatus.cfg:broker_module=/omd/sites/poller01/lib/mk-livestatus/livestatus.o num_client_threads=20 pnp_path=/omd/sites/poller01/var/pnp4nagios/perfdata /omd/sites/poller01/tmp/run/live debug=1
etc/nagios/nagios.d/mk-livestatus.cfg:event_broker_options=-1
etc/nagios/nagios.d/mod-gearman.cfg:event_broker_options=-1
etc/nagios/nagios.d/mod-gearman.cfg:broker_module=/omd/sites/poller01/lib/mod_gearman/mod_gearman.o config=/omd/sites/poller01/etc/mod-gearman/neb.cfg
etc/nagios/nagios.d/pnp4nagios.cfg:broker_module=/omd/sites/poller01/lib/npcdmod.o config_file=/omd/sites/poller01/etc/pnp4nagios/npcd.cfg
We don't have any zombie processes though.
Re: Nagios Stops Executing Checks
Posted: Mon Feb 03, 2014 1:06 pm
by abrist
As this is really an OMD question (concerning their configuration of nagios,the mk suite,and their distribution packages), you really should take this question to their support:
http://omdistro.org/contact
Re: Nagios Stops Executing Checks
Posted: Mon Feb 03, 2014 1:16 pm
by ddoshy
OMD is no more than livestatus + pnp + mod_gearman along with Nagios.. standard addons which a huge bunch of Nagios users seem to be using.
I would disagree that this is a configuration question isolated to OMD, rather we are seeing Nagios die without logging clearly. Would love some insight and help, but its your call if you wish to be in denial.
Re: Nagios Stops Executing Checks
Posted: Mon Feb 03, 2014 1:20 pm
by abrist
ddoshy wrote:etc/nagios/nagios.d/mk-livestatus.cfg:broker_module=/omd/sites/poller01/lib/mk-livestatus/livestatus.o num_client_threads=20 pnp_path=/omd/sites/poller01/var/pnp4nagios/perfdata
What version of livestatus are you running, there were a number of issues with older versions that would cause nagios to die out. Try commenting out the livestatus lines from your nagios.conf, restart the nagios process, and try to get nagios to die out.