Page 3 of 3
Re: Parent/Child Blocking issues
Posted: Thu Mar 16, 2017 1:51 am
by Fred Kroeger
Thanks all - let me know if there is any more info I can feed you. We got a couple of hundred notifcations yesterday for the Child hosts after the Parent went down so in my case the notifications aren't getting blocked nor are the Child hosts seen as unreachable. The email notifications showed them as down. This is similar to the Notifications screenshot I sent previously where there was one child behind the parent.
Fred
Re: Parent/Child Blocking issues
Posted: Thu Mar 16, 2017 12:23 pm
by bheden
I guess I do have a few questions:
Has this ever happened before?
Have you ever had an outage happen previously where you noticed that the states of child hosts were actually being set to UNREACHABLE?
Re: Parent/Child Blocking issues
Posted: Thu Mar 16, 2017 4:59 pm
by bheden
Also, looking through the source in Core - if you were to enable debugging and set verbosity to 2 - we'd probably have some useful debugging output if you were able to simulate another outage perhaps. Is this a possibility?
Re: Parent/Child Blocking issues
Posted: Fri Mar 17, 2017 2:28 am
by Fred Kroeger
I'm pretty sure that I tested this in an older version of Nagios some time ago - which is why I went down this path for this particular installation. Since all the checks are run by a Mod Gearman worker at the clients site, it made sense to make all the hosts a child of the Worker, so if we lost connection to the worker then we wouldn't get hit with an alert for every host and service.
I could schedule this test after Wednesday next week. Let me know what needs to be set and what files you want me to send.
Fred
Re: Parent/Child Blocking issues
Posted: Fri Mar 17, 2017 10:21 am
by bheden
Since all the checks are run by a Mod Gearman worker at the clients site, it made sense to make all the hosts a child of the Worker, so if we lost connection to the worker then we wouldn't get hit with an alert for every host and service.
This
still makes sense.
In regards to your initial post here, the picture with
Code: Select all
/---- FIREWALL-1A ------\
/ \
REMOTE-SITE ---< >--- 2x Devices at Remote Site
\ /
\---- FIREWALL-1B ------/
This doesn't look like a ModGearman parent relationship to me. Perhaps I'm mistaken? You mention this, and then the ModGearman worker as a parent also. Did
ALL of the parent/child relationships fail in such a way that
ALL children of
ALL parents were DOWN instead of UNREACHABLE? Or maybe some of them worked and some of them didn't? If some did work, which ones? By "which ones" - I literally mean the host names so I can match up the relationships based on the profile you've submitted.
Can you point out the host names of some of the ModGearman worker parent/child relationships? I only see one obvious one with it set as the parent for only 2 hosts.
Provide me this information so that I can review your object definitions, and then I can give you a detailed instruction list. Which of the parent/child relationships are you going to simulate failure for?
Also, if you're not comfortable listing those host names publicly I can accept them in a PM.
Thanks.
Re: Parent/Child Blocking issues
Posted: Mon Apr 03, 2017 10:23 pm
by Fred Kroeger
Yes - the original diagram was basically everything after the Worker.
I have PM'd you the full topology together with the hostnames/IPs so that you can follow the paths.
Re: Parent/Child Blocking issues
Posted: Tue Apr 04, 2017 9:34 am
by bheden
Just to get this off of the support team's dashboard, I'm replying. Fred, I'll respond here or reply to your PM directly when I have some meaningful information.