Page 2 of 3

Re: Parent/Child Blocking issues

Posted: Wed Mar 08, 2017 6:31 pm
by Fred Kroeger
I use the Parent/Child relationship in an attempt to stop getting multiple alerts and tickets for the same event. If the link to a remote site goes down, I don't want to get an alert for the 100 hosts at that site. We just should get the one that shows no connectivity to the edge router and that the 100 hosts behind it are unreachable.

The use of Host dependency was brought up my Ludmil to which I was responding.

I understand that everyone uses Nagios differently - perhaps consideration could be given to provide another directive as you have with "host_down_disable_service_checks=1" ?
So that if a Parent Host goes down, that you disable all checks on Child hosts ?

Anyway, the current setup means that when I lose connectivity to a site I get multiple alerts for that event - 1 from each Host. How do I solve this?

regards.... Fred

Re: Parent/Child Blocking issues

Posted: Thu Mar 09, 2017 1:25 pm
by avandemore
I believe the cause of your situation is the use of multiple parents. In such a configuration, you'll still get down notifications when one of the hosts goes down. This is what I see from your logs.

I'm a bit unclear as to your network setup, but often a master/slave router with have a single floating IP. If possible I would define this as the Parent then it would be a more accurate representation of your logical network at least from a Parent/Child relationship. There are other ways to configure your setup in a way that presents an accurate logical network topology to Nagios, but this the most straightforward. If you do not wish to use host dependencies, you'll need to find a way to present the logical layout to Nagios.

The current relationship as Nagios sees it is detailed in the Network Status Map. Making a screenshot of this during good and problematic times would be a good reference as well.

Re: Parent/Child Blocking issues

Posted: Sun Mar 12, 2017 11:26 pm
by Fred Kroeger
I'm not sure you've grasped the issue and have confused it with the multiple parents I have configured. If you go back to the original diagram I attached at the start of this thread, the REMOTE-SITE host is a single monitor. It uses check_muti_addr - so if both IPs don't respond, then it shows down. This is the Parent of all the devices at the site. So if this Host goes down - why am I getting notifications for the child devices?
As I said - it showed it as a blocking outage - so that part works. It's juts the notificatiosn fror the child devices that are not being stopped.

Re: Parent/Child Blocking issues

Posted: Mon Mar 13, 2017 11:20 am
by avandemore
I'm not seeing all the child device notifications in the log. I've PM'd you what I'm looking at. If you can clarify what notifications are in question and provide a screenshot of your Network Status Map that would be great.

Re: Parent/Child Blocking issues

Posted: Mon Mar 13, 2017 7:54 pm
by Fred Kroeger
Done . With the amount of hosts, the Network Status Map is not much help as it is so cluttered -especially as each device is shown twice due to it having two parents. I'll see if I can strip out the irrelevant bits.

Re: Parent/Child Blocking issues

Posted: Tue Mar 14, 2017 9:45 am
by avandemore
I received your information. I just want to confirm my understanding of this. You are wondering why you got a notification for the host FIREWALL-1A as it pertains to the logs you sent correct?

Re: Parent/Child Blocking issues

Posted: Tue Mar 14, 2017 9:55 pm
by Fred Kroeger
no..... look at the Notifications Page snapshot. It was a device behind the firewall. So from the original Host Down, it is two steps behind it.

Re: Parent/Child Blocking issues

Posted: Tue Mar 14, 2017 11:08 pm
by Fred Kroeger
I need to escalate this issue. There is someting seriously wrong with these Parent Child relationships.
I just lost network connection to a Mod Gearman worker that connects to 233 hosts and 1070 services.
The mod Gearman worker is setup as the parent for *all* devices at the site. So If we can't ping the worker, then I expect that will block monitoring of all the child devices. It didn't and I got inundated with alerts and tickets.
The Network outage screen shows that the worker is down but then displays that there are 1695 Hosts affected? Where did the extra 1462 hosts come from ? What does the Severity value indicate?
Capture.PNG
I need a resolution as to why when a Parent Device goes down, we continue to receive alerts and notifications for all the Child devices. Timing is not a factor as I have setup the monitoring such that the Parent will go Hard before any of the Child devices.

Re: Parent/Child Blocking issues

Posted: Wed Mar 15, 2017 12:48 pm
by lmiltchev
The "severity" is roughly calculated as such:

severity = affected hosts + (affected services / 4)

The affected services are divided by 4 as they are considered to be 1/4 as important as hosts. From the "outages.cgi":

Code: Select all

int service_severity_divisor = 4;          /* default = services are 1/4 as important as hosts */
So, in your case, you have:

severity = 1695 + (9537 / 4) = 1695 + 2384 = 4079

Having said that, I was not able to recreate the issue. Here's what I tried.

1. I set up a parent and 2 children.

2. I brought the parent down, then I brought the children down.

3. Under "Host Status" page, the parent is shown as "Down", but the children are shown as "Unreachable". I did get notification about the parent, but not about the children or their services.

4. So, basically I have 3 hosts affected (parent & 2 children) and 21 services (4 on the parent + 3 on the 1st child + 14 on the 2nd child):

severity = 3 + (21 / 4) = 3 + 5 = 8

See images below:

Status Map
example01.PNG
Network Outages
example02.PNG
Can you open an email support ticket in our email ticketing system and send your profile?

Re: Parent/Child Blocking issues

Posted: Wed Mar 15, 2017 1:43 pm
by bheden
Fred,

This has been escalated at least informally and I'll be trying to replicate in-house before I can offer you a solution. Having said that, we've already tried a few things with some basic config duplication and are unable thus far to see the same thing you're seeing.

We did figure out why the numbers are skewed, though. If you have a host (Host E) that has 3 parents defined (Host D, C, B), and each of those parents has the same parent defined (Host A) - then for whatever reason Host E is counted 3 times. So an outage in that case would count for 7 hosts, even though only 5 were truly affected. This same weird math occurs for services as well. We'll be looking into this further and possibly filing a bug.

Anyway, I'll keep you posted.