Nagios Support Forum

Posted: **Wed Jan 08, 2014 7:45 am**

Hi folks,

I'm an experienced Nagios Core admin, and Linux Engineer, but lately I'm facing an unexplainable issue with Nagios Core version 3.5.1 running on CentOS 6.4 and I need your help.
I've scorched the web for answers but couldn't find any, and also spent more than 12 hours trying to understand it so I'm not here to waste anyone's time with basic configuration issues.

For the record, our company has 3 Nagios instances worldwide, all on the same version, same hardware. 2 of them are working fine, but the third (which has the least amount of hosts and services defined) stops executing host&service checks periodically. It works well for approx. 2-3 days, and then all of a sudden stops.

I know it stops working since no data is populated in the PNP graphs, and on the Nagios UI you can see the "Last Time Checked" is 12 hours adrift..
I've read online that this version had log rotation bug that caused the engine to stops functioning once the log is rotated, so I configured in nagios.cfg "log_rotation_method=n"

A few technical specs:
* We're using the 'livestatus' module to integrate Nagios with Thruk (unified dashboard).

* The Nagios engine/service is up and running the whole time, even when checks are not performed.

* 'dmesg' command reveals only little regarding the 'check_nt' plugin, but I don't know if that's related:
check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]

* Nagios' log file contains the following lines:
[02-01-2014 06:30:08] livestatus: error: Client connection terminated while request still incomplete
[02-01-2014 06:30:08] livestatus: Timeout while reading query
[02-01-2014 06:22:51] Auto-save of retention data completed successfully.

Let me know if you need any other technical details in order to assist.
Thanks much for your time and efforts,

BR
Ido

Posted: **Wed Jan 08, 2014 3:40 pm**

Lateralus wrote:check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]

Depending on how this segfault effects the nagios fork(), you could have an ever growing number of zombie/stuck processes. This is solved by compiling check_nt from the newer nagios-plugin sources. Lets check for zombies - you can use "top" or the following awk:

Code: Select all

ps aux | awk '{ print $8 " " $2 }' | grep -w Z

Your issues could be related to livestatus (we have seen it in the past). How are broker directives configured?

Code: Select all

grep broker  /usr/local/nagios/etc/nagios.cfg

Looking at your graphs for the problematic nagios server, do you notice any metric spikes right before failure?

Posted: **Thu Jan 09, 2014 6:24 am**

abrist wrote:
Lateralus wrote:check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]
Depending on how this segfault effects the nagios fork(), you could have an ever growing number of zombie/stuck processes. This is solved by compiling check_nt from the newer nagios-plugin sources. Lets check for zombies - you can use "top" or the following awk:
Code: Select all
ps aux | awk '{ print $8 " " $2 }' | grep -w Z
Your issues could be related to livestatus (we have seen it in the past). How are broker directives configured?
Code: Select all
grep broker  /usr/local/nagios/etc/nagios.cfg
Looking at your graphs for the problematic nagios server, do you notice any metric spikes right before failure?

Thanks for the reply, abrist!

Damn, I now noticed my 'check_nt' version is 1.4.13, which I thought I had already upgraded to 1.4.16. I'll do it now.
And yes, looking at the output of 'ps' I can see Nagios zombie processes every once in a while.

I think I now caught the error in action! here's the output of 'ps':

Code: Select all

[root@US-Nagios-LP1 libexec]# ps aux | awk '{ print $8 " " $2 }' | grep -w Z
Z 24360
Z 24361
[root@US-Nagios-LP1 libexec]# ps -ef | grep nagios
root      1511  3787  0 03:18 pts/1    00:00:00 grep nagios
nagios    9607     1  0 Jan08 ?        00:01:56 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   24360  9607  0 Jan08 ?        00:00:00 [nagios] <defunct>
nagios   24361  9607  0 Jan08 ?        00:00:00 [nagios] <defunct>
nagios   24375  9607  0 Jan08 ?        00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

As for the broker module, it's configured identically on all Nagios servers we have, and no, there are no spikes on the machine prior to the crash.

Code: Select all

[root@US-Nagios-LP1 libexec]# grep broker ../etc/nagios.cfg
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /usr/local/nagios/var/rw/live
event_broker_options=-1

RIght now the Nagios engine is not performing active checks.
I'll recompile the Nagios plugins and let's hope that helps.

Thanks
Ido

Posted: **Thu Jan 09, 2014 10:48 am**

Make sure that you recompile plugins which are compatible with 3.5.1, there were some big changes made to Core since the 4.0 release and the plugins had to be altered to these specifications. We have seen some backwards compatibility issues, but check_nt should be fine.

Posted: **Thu Jan 09, 2014 2:55 pm**

slansing wrote:Make sure that you recompile plugins which are compatible with 3.5.1, there were some big changes made to Core since the 4.0 release and the plugins had to be altered to these specifications. We have seen some backwards compatibility issues, but check_nt should be fine.

Thanks for the heads up mate.
I've compiled the 1.4.16 version on 2 out of the 3 Nagios instances we have, and they've been working great.

However, I saw there's a new version - 1.5 ; but I didn't see any warnings or reference regarding the version of Nagios it supports.
Do you know if this version should work with my 3.5.1 Nagios?

Thanks
Ido

Posted: **Thu Jan 09, 2014 3:23 pm**

Lateralus wrote:Do you know if this version should work with my 3.5.1 Nagios?

All of the above, including 1.5.x.

Posted: **Mon Feb 03, 2014 11:56 am**

We've been observing a similar issue with 2 of our OMD slave/poller servers, although with the latest version of OMD (1.10), which includes livestatus, pnp4nagios, mod_gearman.

Nagios version is 3.5.0

The Nagios process simply goes away, but the lock file remains. Nagios gives no indication (to our limited understanding) as to what triggered the process termination. We've enabled debugging in nagios, snippets below.

Would appreciate a pointer in the right direction to troubleshoot further or even better to resolve this.

Log snippets

var/log/nagios.log

Code: Select all

[1391434926] Auto-save of retention data completed successfully.
[1391438526] Auto-save of retention data completed successfully.

var/nagios/debug.log

Code: Select all

[1391438526.700324] [008.1] [pid=20221] ** Event Check Loop
[1391438526.700413] [008.1] [pid=20221] Next High Priority Event Time: Mon Feb  3 20:12:07 2014
[1391438526.700421] [008.1] [pid=20221] No low priority events are scheduled...
[1391438526.700427] [008.1] [pid=20221] Current/Max Service Checks: 0/0
[1391438526.700440] [008.2] [pid=20221] No events to execute at the moment.  Idling for a bit...
[1391438526.700446] [001.0] [pid=20221] check_for_external_commands()
[1391438526.700453] [064.1] [pid=20221] Making callbacks (type 8)...
[1391438526.700459] [064.2] [pid=20221] Callback #1 (type 8) return code = 0
[1391438526.700468] [064.2] [pid=20221] Callback #2 (type 8) return code = 0

grep broker etc/nagios/nagios.d/

Code: Select all

etc/nagios/nagios.d/mk-livestatus.cfg:broker_module=/omd/sites/poller01/lib/mk-livestatus/livestatus.o num_client_threads=20 pnp_path=/omd/sites/poller01/var/pnp4nagios/perfdata /omd/sites/poller01/tmp/run/live debug=1
etc/nagios/nagios.d/mk-livestatus.cfg:event_broker_options=-1

etc/nagios/nagios.d/mod-gearman.cfg:event_broker_options=-1
etc/nagios/nagios.d/mod-gearman.cfg:broker_module=/omd/sites/poller01/lib/mod_gearman/mod_gearman.o config=/omd/sites/poller01/etc/mod-gearman/neb.cfg

etc/nagios/nagios.d/pnp4nagios.cfg:broker_module=/omd/sites/poller01/lib/npcdmod.o config_file=/omd/sites/poller01/etc/pnp4nagios/npcd.cfg

We don't have any zombie processes though.

Posted: **Mon Feb 03, 2014 1:06 pm**

As this is really an OMD question (concerning their configuration of nagios,the mk suite,and their distribution packages), you really should take this question to their support:
http://omdistro.org/contact

Posted: **Mon Feb 03, 2014 1:16 pm**

OMD is no more than livestatus + pnp + mod_gearman along with Nagios.. standard addons which a huge bunch of Nagios users seem to be using.

I would disagree that this is a configuration question isolated to OMD, rather we are seeing Nagios die without logging clearly. Would love some insight and help, but its your call if you wish to be in denial.

Posted: **Mon Feb 03, 2014 1:20 pm**

ddoshy wrote:etc/nagios/nagios.d/mk-livestatus.cfg:broker_module=/omd/sites/poller01/lib/mk-livestatus/livestatus.o num_client_threads=20 pnp_path=/omd/sites/poller01/var/pnp4nagios/perfdata

What version of livestatus are you running, there were a number of issues with older versions that would cause nagios to die out. Try commenting out the livestatus lines from your nagios.conf, restart the nagios process, and try to get nagios to die out.

Nagios Support Forum

Nagios Stops Executing Checks

Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks

Re: Nagios Stops Executing Checks