I have a Nagios XI 2014R2.6 setup where I have a central Nagios XI server and 4 servers as gearman workers. At around 2015-05-19 11:58:35 I added 3 contacts, 3 contact groups and added one contact to each contact group. At 2015-05-19 11:59:10 I used the Bulk Modification tool to add a contact group I created to a set of 3 hosts and one service from each of the 3 hosts. At 2015-05-19 11:59:51 I used the Bulk Modification tool to removed a contact from the same hosts and services that I added the previous contact group to. At this point I left to lunch.. and when I got back there where tons of hosts showing up as down with the message "host check orphaned, is the mod-gearman worker on queue 'host' running?"
I looked into the host checks from each host and noticed that some where missing freshness threshold, status info, flap detection, active checks and passive checks... were set to the skip option.
We noticed this on the main server which seemed to be the culprit:
[root@ ~]# ps -ef | grep nagios.cfg
nagios 32974 1 1 11:59 ? 00:01:49 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33077 32974 0 11:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33336 1 2 11:59 ? 00:02:54 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33438 33336 0 11:59 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
We killed nagios:
[root@ ~]#killall -9 nagios
We restarted nagios:
[root@ ~]#service nagios restart
All host checks then started to come up OK so the issue is resolved. I then had to explain all the false alerts that went out...
My question is why did a second process spawn? Shouldn't it have been killed?
Nagios XI host check orphaned and duplicate nagios process
Re: Nagios XI host check orphaned and duplicate nagios proce
The bulk mod tool has a feature to apply configuration so that the changes are deployed. This is probably when the second process was spawned. But:emartine wrote:My question is why did a second process spawn?
Yes. The old nagios process should have been stopped. Could you go ahead and run apply configuration, and then check to see if a second parent process is spawned again?emartine wrote:Shouldn't it have been killed?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Nagios XI host check orphaned and duplicate nagios proce
Apply created a new parent process and the old one is gone.
[root@ ~]# ps -ef | grep nagios.cfg
nagios 33351 1 8 15:15 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33455 33351 0 15:15 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Well... I just noticed that the monitoring engine died. So I had to use the monitoring engine status to start it back up. Now it seem to be ok.
I'm keeping an eye on this..
[root@ ~]# ps -ef | grep nagios.cfg
nagios 33351 1 8 15:15 ? 00:00:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 33455 33351 0 15:15 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Well... I just noticed that the monitoring engine died. So I had to use the monitoring engine status to start it back up. Now it seem to be ok.
I'm keeping an eye on this..
Re: Nagios XI host check orphaned and duplicate nagios proce
So I still don't know what caused a second process to spawn and the old one not die. I want to prevent this incident from happening again. Any ideas?
Re: Nagios XI host check orphaned and duplicate nagios proce
Let's get some baseline information:
OS and version
Core version (should be 4.0.8 but check with /usr/local/nagios/bin/nagios --version)
Any security settings in place?
Honestly it might have been a weird race condition, especially if i can't be reproduced.
OS and version
Core version (should be 4.0.8 but check with /usr/local/nagios/bin/nagios --version)
Any security settings in place?
Honestly it might have been a weird race condition, especially if i can't be reproduced.
Former Nagios employee
Re: Nagios XI host check orphaned and duplicate nagios proce
It's hard to say. Perhaps, nagios didn't exit in a timely manner (issues with gearman workers?). You can try to increase the "for loop" in "/etc/init.d/nagios" to a value, longer than your longest plugin's timeout. See our FAQ wiki article on this topic here:
http://support.nagios.com/wiki/index.ph ... ely_manner
http://support.nagios.com/wiki/index.ph ... ely_manner
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios XI host check orphaned and duplicate nagios proce
RHEL 6 x64tmcdonald wrote:Let's get some baseline information:
OS and version
Core version (should be 4.0.8 but check with /usr/local/nagios/bin/nagios --version)
Any security settings in place?
Honestly it might have been a weird race condition, especially if i can't be reproduced.
Nagios Core 4.0.8
SELinux is disabled
Race condition? I have 130GB of RAM, Nagios XI sits on a 1TB fusion io card with a 48 proc Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz. The main server along with the gearman worker servers have this spec. They were built purposely as overkill so as to not experience any slowness so it should never be slow.
Re: Nagios XI host check orphaned and duplicate nagios proce
if I make this change to the nagios run file will it remain persistent across updates?
Re: Nagios XI host check orphaned and duplicate nagios proce
That file will be over written on the next update so it is not persistence.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios XI host check orphaned and duplicate nagios proce
Ok. I added two different servers back to back using the windows wizard. The first server hadn't had any checks and was pending before I completed the second one.
Here is my theory -
I am assuming that the first nagios process spawned was for the first server I added and the second nagios process spawned for the second server I added... and the first one stuck around because the first one was pending checks for the first server.
Killing the nagios process through normal kill <parent ID> method doesn't work and I am thinking that this is what is what nagios does when it attempts to run a new process again? Maybe it needs to do killall -9 nagios ? Clicking apply won't kill the first process. It still sticks around. So is this a bug?
ps -ef | grep nagios.cfg
nagios 29934 1 5 10:46 ? 00:00:59 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 30039 29934 0 10:46 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36803 1 6 11:03 ? 00:00:07 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36949 36803 0 11:03 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 37790 37521 0 11:05 pts/0 00:00:00 grep nagios.cfg
Here is my theory -
I am assuming that the first nagios process spawned was for the first server I added and the second nagios process spawned for the second server I added... and the first one stuck around because the first one was pending checks for the first server.
Killing the nagios process through normal kill <parent ID> method doesn't work and I am thinking that this is what is what nagios does when it attempts to run a new process again? Maybe it needs to do killall -9 nagios ? Clicking apply won't kill the first process. It still sticks around. So is this a bug?
ps -ef | grep nagios.cfg
nagios 29934 1 5 10:46 ? 00:00:59 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 30039 29934 0 10:46 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36803 1 6 11:03 ? 00:00:07 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36949 36803 0 11:03 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 37790 37521 0 11:05 pts/0 00:00:00 grep nagios.cfg