Nagios Stops Executing Checks
Nagios Stops Executing Checks
Hi folks,
I'm an experienced Nagios Core admin, and Linux Engineer, but lately I'm facing an unexplainable issue with Nagios Core version 3.5.1 running on CentOS 6.4 and I need your help.
I've scorched the web for answers but couldn't find any, and also spent more than 12 hours trying to understand it so I'm not here to waste anyone's time with basic configuration issues.
For the record, our company has 3 Nagios instances worldwide, all on the same version, same hardware. 2 of them are working fine, but the third (which has the least amount of hosts and services defined) stops executing host&service checks periodically. It works well for approx. 2-3 days, and then all of a sudden stops.
I know it stops working since no data is populated in the PNP graphs, and on the Nagios UI you can see the "Last Time Checked" is 12 hours adrift..
I've read online that this version had log rotation bug that caused the engine to stops functioning once the log is rotated, so I configured in nagios.cfg "log_rotation_method=n"
A few technical specs:
* We're using the 'livestatus' module to integrate Nagios with Thruk (unified dashboard).
* The Nagios engine/service is up and running the whole time, even when checks are not performed.
* 'dmesg' command reveals only little regarding the 'check_nt' plugin, but I don't know if that's related:
check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]
* Nagios' log file contains the following lines:
[02-01-2014 06:30:08] livestatus: error: Client connection terminated while request still incomplete
[02-01-2014 06:30:08] livestatus: Timeout while reading query
[02-01-2014 06:22:51] Auto-save of retention data completed successfully.
Let me know if you need any other technical details in order to assist.
Thanks much for your time and efforts,
BR
Ido
I'm an experienced Nagios Core admin, and Linux Engineer, but lately I'm facing an unexplainable issue with Nagios Core version 3.5.1 running on CentOS 6.4 and I need your help.
I've scorched the web for answers but couldn't find any, and also spent more than 12 hours trying to understand it so I'm not here to waste anyone's time with basic configuration issues.
For the record, our company has 3 Nagios instances worldwide, all on the same version, same hardware. 2 of them are working fine, but the third (which has the least amount of hosts and services defined) stops executing host&service checks periodically. It works well for approx. 2-3 days, and then all of a sudden stops.
I know it stops working since no data is populated in the PNP graphs, and on the Nagios UI you can see the "Last Time Checked" is 12 hours adrift..
I've read online that this version had log rotation bug that caused the engine to stops functioning once the log is rotated, so I configured in nagios.cfg "log_rotation_method=n"
A few technical specs:
* We're using the 'livestatus' module to integrate Nagios with Thruk (unified dashboard).
* The Nagios engine/service is up and running the whole time, even when checks are not performed.
* 'dmesg' command reveals only little regarding the 'check_nt' plugin, but I don't know if that's related:
check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]
* Nagios' log file contains the following lines:
[02-01-2014 06:30:08] livestatus: error: Client connection terminated while request still incomplete
[02-01-2014 06:30:08] livestatus: Timeout while reading query
[02-01-2014 06:22:51] Auto-save of retention data completed successfully.
Let me know if you need any other technical details in order to assist.
Thanks much for your time and efforts,
BR
Ido
Re: Nagios Stops Executing Checks
Depending on how this segfault effects the nagios fork(), you could have an ever growing number of zombie/stuck processes. This is solved by compiling check_nt from the newer nagios-plugin sources. Lets check for zombies - you can use "top" or the following awk:Lateralus wrote:check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]
Code: Select all
ps aux | awk '{ print $8 " " $2 }' | grep -w ZCode: Select all
grep broker /usr/local/nagios/etc/nagios.cfgFormer Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios Stops Executing Checks
Thanks for the reply, abrist!abrist wrote:Depending on how this segfault effects the nagios fork(), you could have an ever growing number of zombie/stuck processes. This is solved by compiling check_nt from the newer nagios-plugin sources. Lets check for zombies - you can use "top" or the following awk:Lateralus wrote:check_nt[13010]: segfault at 0 ip 00000036f243a734 sp 00007fffd62aafa0 error 4 in libc-2.12.so[36f2400000+189000]Your issues could be related to livestatus (we have seen it in the past). How are broker directives configured?Code: Select all
ps aux | awk '{ print $8 " " $2 }' | grep -w ZLooking at your graphs for the problematic nagios server, do you notice any metric spikes right before failure?Code: Select all
grep broker /usr/local/nagios/etc/nagios.cfg
Damn, I now noticed my 'check_nt' version is 1.4.13, which I thought I had already upgraded to 1.4.16. I'll do it now.
And yes, looking at the output of 'ps' I can see Nagios zombie processes every once in a while.
I think I now caught the error in action! here's the output of 'ps':
Code: Select all
[root@US-Nagios-LP1 libexec]# ps aux | awk '{ print $8 " " $2 }' | grep -w Z
Z 24360
Z 24361
[root@US-Nagios-LP1 libexec]# ps -ef | grep nagios
root 1511 3787 0 03:18 pts/1 00:00:00 grep nagios
nagios 9607 1 0 Jan08 ? 00:01:56 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 24360 9607 0 Jan08 ? 00:00:00 [nagios] <defunct>
nagios 24361 9607 0 Jan08 ? 00:00:00 [nagios] <defunct>
nagios 24375 9607 0 Jan08 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Code: Select all
[root@US-Nagios-LP1 libexec]# grep broker ../etc/nagios.cfg
broker_module=/usr/local/lib/mk-livestatus/livestatus.o /usr/local/nagios/var/rw/live
event_broker_options=-1
I'll recompile the Nagios plugins and let's hope that helps.
Thanks
Ido
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Nagios Stops Executing Checks
Make sure that you recompile plugins which are compatible with 3.5.1, there were some big changes made to Core since the 4.0 release and the plugins had to be altered to these specifications. We have seen some backwards compatibility issues, but check_nt should be fine.
Re: Nagios Stops Executing Checks
slansing wrote:Make sure that you recompile plugins which are compatible with 3.5.1, there were some big changes made to Core since the 4.0 release and the plugins had to be altered to these specifications. We have seen some backwards compatibility issues, but check_nt should be fine.
Thanks for the heads up mate.
I've compiled the 1.4.16 version on 2 out of the 3 Nagios instances we have, and they've been working great.
However, I saw there's a new version - 1.5 ; but I didn't see any warnings or reference regarding the version of Nagios it supports.
Do you know if this version should work with my 3.5.1 Nagios?
Thanks
Ido
Re: Nagios Stops Executing Checks
All of the above, including 1.5.x.Lateralus wrote:Do you know if this version should work with my 3.5.1 Nagios?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios Stops Executing Checks
We've been observing a similar issue with 2 of our OMD slave/poller servers, although with the latest version of OMD (1.10), which includes livestatus, pnp4nagios, mod_gearman.
Nagios version is 3.5.0
The Nagios process simply goes away, but the lock file remains. Nagios gives no indication (to our limited understanding) as to what triggered the process termination. We've enabled debugging in nagios, snippets below.
Would appreciate a pointer in the right direction to troubleshoot further or even better to resolve this.
Log snippets
var/log/nagios.log
var/nagios/debug.log
grep broker etc/nagios/nagios.d/
We don't have any zombie processes though.
Nagios version is 3.5.0
The Nagios process simply goes away, but the lock file remains. Nagios gives no indication (to our limited understanding) as to what triggered the process termination. We've enabled debugging in nagios, snippets below.
Would appreciate a pointer in the right direction to troubleshoot further or even better to resolve this.
Log snippets
var/log/nagios.log
Code: Select all
[1391434926] Auto-save of retention data completed successfully.
[1391438526] Auto-save of retention data completed successfully.
Code: Select all
[1391438526.700324] [008.1] [pid=20221] ** Event Check Loop
[1391438526.700413] [008.1] [pid=20221] Next High Priority Event Time: Mon Feb 3 20:12:07 2014
[1391438526.700421] [008.1] [pid=20221] No low priority events are scheduled...
[1391438526.700427] [008.1] [pid=20221] Current/Max Service Checks: 0/0
[1391438526.700440] [008.2] [pid=20221] No events to execute at the moment. Idling for a bit...
[1391438526.700446] [001.0] [pid=20221] check_for_external_commands()
[1391438526.700453] [064.1] [pid=20221] Making callbacks (type 8)...
[1391438526.700459] [064.2] [pid=20221] Callback #1 (type 8) return code = 0
[1391438526.700468] [064.2] [pid=20221] Callback #2 (type 8) return code = 0
Code: Select all
etc/nagios/nagios.d/mk-livestatus.cfg:broker_module=/omd/sites/poller01/lib/mk-livestatus/livestatus.o num_client_threads=20 pnp_path=/omd/sites/poller01/var/pnp4nagios/perfdata /omd/sites/poller01/tmp/run/live debug=1
etc/nagios/nagios.d/mk-livestatus.cfg:event_broker_options=-1
etc/nagios/nagios.d/mod-gearman.cfg:event_broker_options=-1
etc/nagios/nagios.d/mod-gearman.cfg:broker_module=/omd/sites/poller01/lib/mod_gearman/mod_gearman.o config=/omd/sites/poller01/etc/mod-gearman/neb.cfg
etc/nagios/nagios.d/pnp4nagios.cfg:broker_module=/omd/sites/poller01/lib/npcdmod.o config_file=/omd/sites/poller01/etc/pnp4nagios/npcd.cfg
Re: Nagios Stops Executing Checks
As this is really an OMD question (concerning their configuration of nagios,the mk suite,and their distribution packages), you really should take this question to their support:
http://omdistro.org/contact
http://omdistro.org/contact
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios Stops Executing Checks
OMD is no more than livestatus + pnp + mod_gearman along with Nagios.. standard addons which a huge bunch of Nagios users seem to be using.
I would disagree that this is a configuration question isolated to OMD, rather we are seeing Nagios die without logging clearly. Would love some insight and help, but its your call if you wish to be in denial.
I would disagree that this is a configuration question isolated to OMD, rather we are seeing Nagios die without logging clearly. Would love some insight and help, but its your call if you wish to be in denial.
Re: Nagios Stops Executing Checks
What version of livestatus are you running, there were a number of issues with older versions that would cause nagios to die out. Try commenting out the livestatus lines from your nagios.conf, restart the nagios process, and try to get nagios to die out.ddoshy wrote:etc/nagios/nagios.d/mk-livestatus.cfg:broker_module=/omd/sites/poller01/lib/mk-livestatus/livestatus.o num_client_threads=20 pnp_path=/omd/sites/poller01/var/pnp4nagios/perfdata
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.