Nagios XI host check orphaned and duplicate nagios process

emartine · Post by **emartine** » Tue May 19, 2015 2:47 pm

I have a Nagios XI 2014R2.6 setup where I have a central Nagios XI server and 4 servers as gearman workers. At around 2015-05-19 11:58:35 I added 3 contacts, 3 contact groups and added one contact to each contact group. At 2015-05-19 11:59:10 I used the Bulk Modification tool to add a contact group I created to a set of 3 hosts and one service from each of the 3 hosts. At 2015-05-19 11:59:51 I used the Bulk Modification tool to removed a contact from the same hosts and services that I added the previous contact group to. At this point I left to lunch.. and when I got back there where tons of hosts showing up as down with the message "host check orphaned, is the mod-gearman worker on queue 'host' running?"

I looked into the host checks from each host and noticed that some where missing freshness threshold, status info, flap detection, active checks and passive checks... were set to the skip option.

We noticed this on the main server which seemed to be the culprit:
[root@ ~]# ps -ef | grep nagios.cfg
nagios 32974 1 1 11:59 ? 00:01:49 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33077 32974 0 11:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33336 1 2 11:59 ? 00:02:54 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33438 33336 0 11:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

We killed nagios:
[root@ ~]#killall -9 nagios

We restarted nagios:
[root@ ~]#service nagios restart

All host checks then started to come up OK so the issue is resolved. I then had to explain all the false alerts that went out...
My question is why did a second process spawn? Shouldn't it have been killed?

abrist · Post by **abrist** » Tue May 19, 2015 3:05 pm

emartine wrote:My question is why did a second process spawn?

The bulk mod tool has a feature to apply configuration so that the changes are deployed. This is probably when the second process was spawned. But:

emartine wrote:Shouldn't it have been killed?

Yes. The old nagios process should have been stopped. Could you go ahead and run apply configuration, and then check to see if a second parent process is spawned again?

emartine · Post by **emartine** » Tue May 19, 2015 3:18 pm

Apply created a new parent process and the old one is gone.

[root@ ~]# ps -ef | grep nagios.cfg
nagios 33351 1 8 15:15 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33455 33351 0 15:15 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Well... I just noticed that the monitoring engine died. So I had to use the monitoring engine status to start it back up. Now it seem to be ok.

I'm keeping an eye on this..

emartine · Post by **emartine** » Tue May 19, 2015 4:14 pm

So I still don't know what caused a second process to spawn and the old one not die. I want to prevent this incident from happening again. Any ideas?

tmcdonald · Post by **tmcdonald** » Wed May 20, 2015 11:42 am

Let's get some baseline information:

OS and version
Core version (should be 4.0.8 but check with /usr/local/nagios/bin/nagios --version)
Any security settings in place?

Honestly it might have been a weird race condition, especially if i can't be reproduced.

Post by **lmiltchev** » Wed May 20, 2015 11:46 am

It's hard to say. Perhaps, nagios didn't exit in a timely manner (issues with gearman workers?). You can try to increase the "for loop" in "/etc/init.d/nagios" to a value, longer than your longest plugin's timeout. See our FAQ wiki article on this topic here:

http://support.nagios.com/wiki/index.ph ... ely_manner

emartine · Post by **emartine** » Wed May 20, 2015 6:29 pm

tmcdonald wrote:Let's get some baseline information:

OS and version

Core version (should be 4.0.8 but check with /usr/local/nagios/bin/nagios --version)

Any security settings in place?

Honestly it might have been a weird race condition, especially if i can't be reproduced.

RHEL 6 x64
Nagios Core 4.0.8
SELinux is disabled
Race condition? I have 130GB of RAM, Nagios XI sits on a 1TB fusion io card with a 48 proc Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz. The main server along with the gearman worker servers have this spec. They were built purposely as overkill so as to not experience any slowness so it should never be slow.

emartine · Post by **emartine** » Wed May 20, 2015 6:34 pm

if I make this change to the nagios run file will it remain persistent across updates?

Post by **tgriep** » Thu May 21, 2015 8:56 am

That file will be over written on the next update so it is not persistence.

emartine · Post by **emartine** » Mon Jun 22, 2015 11:27 am

Ok. I added two different servers back to back using the windows wizard. The first server hadn't had any checks and was pending before I completed the second one.

Here is my theory -
I am assuming that the first nagios process spawned was for the first server I added and the second nagios process spawned for the second server I added... and the first one stuck around because the first one was pending checks for the first server.
Killing the nagios process through normal kill <parent ID> method doesn't work and I am thinking that this is what is what nagios does when it attempts to run a new process again? Maybe it needs to do killall -9 nagios ? Clicking apply won't kill the first process. It still sticks around. So is this a bug?

ps -ef | grep nagios.cfg
nagios 29934 1 5 10:46 ? 00:00:59 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 30039 29934 0 10:46 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36803 1 6 11:03 ? 00:00:07 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36949 36803 0 11:03 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 37790 37521 0 11:05 pts/0 00:00:00 grep nagios.cfg

Nagios Support Forum

Nagios XI host check orphaned and duplicate nagios process

Nagios XI host check orphaned and duplicate nagios process

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce

Re: Nagios XI host check orphaned and duplicate nagios proce