Nagios XI host check orphaned and duplicate nagios process

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
emartine
Posts: 660
Joined: Thu Dec 29, 2011 10:47 am

Nagios XI host check orphaned and duplicate nagios process

Post by emartine »

I have a Nagios XI 2014R2.6 setup where I have a central Nagios XI server and 4 servers as gearman workers. At around 2015-05-19 11:58:35 I added 3 contacts, 3 contact groups and added one contact to each contact group. At 2015-05-19 11:59:10 I used the Bulk Modification tool to add a contact group I created to a set of 3 hosts and one service from each of the 3 hosts. At 2015-05-19 11:59:51 I used the Bulk Modification tool to removed a contact from the same hosts and services that I added the previous contact group to. At this point I left to lunch.. and when I got back there where tons of hosts showing up as down with the message "host check orphaned, is the mod-gearman worker on queue 'host' running?"

I looked into the host checks from each host and noticed that some where missing freshness threshold, status info, flap detection, active checks and passive checks... were set to the skip option.

We noticed this on the main server which seemed to be the culprit:
[root@ ~]# ps -ef | grep nagios.cfg
nagios 32974 1 1 11:59 ? 00:01:49 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33077 32974 0 11:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33336 1 2 11:59 ? 00:02:54 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33438 33336 0 11:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

We killed nagios:
[root@ ~]#killall -9 nagios

We restarted nagios:
[root@ ~]#service nagios restart

All host checks then started to come up OK so the issue is resolved. I then had to explain all the false alerts that went out...
My question is why did a second process spawn? Shouldn't it have been killed?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by abrist »

emartine wrote:My question is why did a second process spawn?
The bulk mod tool has a feature to apply configuration so that the changes are deployed. This is probably when the second process was spawned. But:
emartine wrote:Shouldn't it have been killed?
Yes. The old nagios process should have been stopped. Could you go ahead and run apply configuration, and then check to see if a second parent process is spawned again?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
User avatar
emartine
Posts: 660
Joined: Thu Dec 29, 2011 10:47 am

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by emartine »

Apply created a new parent process and the old one is gone.

[root@ ~]# ps -ef | grep nagios.cfg
nagios 33351 1 8 15:15 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33455 33351 0 15:15 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Well... I just noticed that the monitoring engine died. So I had to use the monitoring engine status to start it back up. Now it seem to be ok.

I'm keeping an eye on this..
User avatar
emartine
Posts: 660
Joined: Thu Dec 29, 2011 10:47 am

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by emartine »

So I still don't know what caused a second process to spawn and the old one not die. I want to prevent this incident from happening again. Any ideas?
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by tmcdonald »

Let's get some baseline information:

OS and version
Core version (should be 4.0.8 but check with /usr/local/nagios/bin/nagios --version)
Any security settings in place?

Honestly it might have been a weird race condition, especially if i can't be reproduced.
Former Nagios employee
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by lmiltchev »

It's hard to say. Perhaps, nagios didn't exit in a timely manner (issues with gearman workers?). You can try to increase the "for loop" in "/etc/init.d/nagios" to a value, longer than your longest plugin's timeout. See our FAQ wiki article on this topic here:

http://support.nagios.com/wiki/index.ph ... ely_manner
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
emartine
Posts: 660
Joined: Thu Dec 29, 2011 10:47 am

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by emartine »

tmcdonald wrote:Let's get some baseline information:

OS and version

Core version (should be 4.0.8 but check with /usr/local/nagios/bin/nagios --version)

Any security settings in place?

Honestly it might have been a weird race condition, especially if i can't be reproduced.
RHEL 6 x64
Nagios Core 4.0.8
SELinux is disabled
Race condition? I have 130GB of RAM, Nagios XI sits on a 1TB fusion io card with a 48 proc Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz. The main server along with the gearman worker servers have this spec. They were built purposely as overkill so as to not experience any slowness so it should never be slow.
User avatar
emartine
Posts: 660
Joined: Thu Dec 29, 2011 10:47 am

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by emartine »

if I make this change to the nagios run file will it remain persistent across updates?
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by tgriep »

That file will be over written on the next update so it is not persistence.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
emartine
Posts: 660
Joined: Thu Dec 29, 2011 10:47 am

Re: Nagios XI host check orphaned and duplicate nagios proce

Post by emartine »

Ok. I added two different servers back to back using the windows wizard. The first server hadn't had any checks and was pending before I completed the second one.

Here is my theory -
I am assuming that the first nagios process spawned was for the first server I added and the second nagios process spawned for the second server I added... and the first one stuck around because the first one was pending checks for the first server.
Killing the nagios process through normal kill <parent ID> method doesn't work and I am thinking that this is what is what nagios does when it attempts to run a new process again? Maybe it needs to do killall -9 nagios ? Clicking apply won't kill the first process. It still sticks around. So is this a bug?

ps -ef | grep nagios.cfg
nagios 29934 1 5 10:46 ? 00:00:59 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 30039 29934 0 10:46 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36803 1 6 11:03 ? 00:00:07 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36949 36803 0 11:03 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 37790 37521 0 11:05 pts/0 00:00:00 grep nagios.cfg
Locked