Parent/Child Blocking issues

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Parent/Child Blocking issues

Post by Fred Kroeger »

I use the Parent/Child relationship in an attempt to stop getting multiple alerts and tickets for the same event. If the link to a remote site goes down, I don't want to get an alert for the 100 hosts at that site. We just should get the one that shows no connectivity to the edge router and that the 100 hosts behind it are unreachable.

The use of Host dependency was brought up my Ludmil to which I was responding.

I understand that everyone uses Nagios differently - perhaps consideration could be given to provide another directive as you have with "host_down_disable_service_checks=1" ?
So that if a Parent Host goes down, that you disable all checks on Child hosts ?

Anyway, the current setup means that when I lose connectivity to a site I get multiple alerts for that event - 1 from each Host. How do I solve this?

regards.... Fred
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Parent/Child Blocking issues

Post by avandemore »

I believe the cause of your situation is the use of multiple parents. In such a configuration, you'll still get down notifications when one of the hosts goes down. This is what I see from your logs.

I'm a bit unclear as to your network setup, but often a master/slave router with have a single floating IP. If possible I would define this as the Parent then it would be a more accurate representation of your logical network at least from a Parent/Child relationship. There are other ways to configure your setup in a way that presents an accurate logical network topology to Nagios, but this the most straightforward. If you do not wish to use host dependencies, you'll need to find a way to present the logical layout to Nagios.

The current relationship as Nagios sees it is detailed in the Network Status Map. Making a screenshot of this during good and problematic times would be a good reference as well.
Previous Nagios employee
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Parent/Child Blocking issues

Post by Fred Kroeger »

I'm not sure you've grasped the issue and have confused it with the multiple parents I have configured. If you go back to the original diagram I attached at the start of this thread, the REMOTE-SITE host is a single monitor. It uses check_muti_addr - so if both IPs don't respond, then it shows down. This is the Parent of all the devices at the site. So if this Host goes down - why am I getting notifications for the child devices?
As I said - it showed it as a blocking outage - so that part works. It's juts the notificatiosn fror the child devices that are not being stopped.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Parent/Child Blocking issues

Post by avandemore »

I'm not seeing all the child device notifications in the log. I've PM'd you what I'm looking at. If you can clarify what notifications are in question and provide a screenshot of your Network Status Map that would be great.
Previous Nagios employee
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Parent/Child Blocking issues

Post by Fred Kroeger »

Done . With the amount of hosts, the Network Status Map is not much help as it is so cluttered -especially as each device is shown twice due to it having two parents. I'll see if I can strip out the irrelevant bits.
avandemore
Posts: 1597
Joined: Tue Sep 27, 2016 4:57 pm

Re: Parent/Child Blocking issues

Post by avandemore »

I received your information. I just want to confirm my understanding of this. You are wondering why you got a notification for the host FIREWALL-1A as it pertains to the logs you sent correct?
Previous Nagios employee
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Parent/Child Blocking issues

Post by Fred Kroeger »

no..... look at the Notifications Page snapshot. It was a device behind the firewall. So from the original Host Down, it is two steps behind it.
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Parent/Child Blocking issues

Post by Fred Kroeger »

I need to escalate this issue. There is someting seriously wrong with these Parent Child relationships.
I just lost network connection to a Mod Gearman worker that connects to 233 hosts and 1070 services.
The mod Gearman worker is setup as the parent for *all* devices at the site. So If we can't ping the worker, then I expect that will block monitoring of all the child devices. It didn't and I got inundated with alerts and tickets.
The Network outage screen shows that the worker is down but then displays that there are 1695 Hosts affected? Where did the extra 1462 hosts come from ? What does the Severity value indicate?
Capture.PNG
I need a resolution as to why when a Parent Device goes down, we continue to receive alerts and notifications for all the Child devices. Timing is not a factor as I have setup the monitoring such that the Parent will go Hard before any of the Child devices.
You do not have the required permissions to view the files attached to this post.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Parent/Child Blocking issues

Post by lmiltchev »

The "severity" is roughly calculated as such:

severity = affected hosts + (affected services / 4)

The affected services are divided by 4 as they are considered to be 1/4 as important as hosts. From the "outages.cgi":

Code: Select all

int service_severity_divisor = 4;          /* default = services are 1/4 as important as hosts */
So, in your case, you have:

severity = 1695 + (9537 / 4) = 1695 + 2384 = 4079

Having said that, I was not able to recreate the issue. Here's what I tried.

1. I set up a parent and 2 children.

2. I brought the parent down, then I brought the children down.

3. Under "Host Status" page, the parent is shown as "Down", but the children are shown as "Unreachable". I did get notification about the parent, but not about the children or their services.

4. So, basically I have 3 hosts affected (parent & 2 children) and 21 services (4 on the parent + 3 on the 1st child + 14 on the 2nd child):

severity = 3 + (21 / 4) = 3 + 5 = 8

See images below:

Status Map
example01.PNG
Network Outages
example02.PNG
Can you open an email support ticket in our email ticketing system and send your profile?
You do not have the required permissions to view the files attached to this post.
Be sure to check out our Knowledgebase for helpful articles and solutions!
bheden
Product Development Manager
Posts: 179
Joined: Thu Feb 13, 2014 9:50 am
Location: Nagios Enterprises

Re: Parent/Child Blocking issues

Post by bheden »

Fred,

This has been escalated at least informally and I'll be trying to replicate in-house before I can offer you a solution. Having said that, we've already tried a few things with some basic config duplication and are unable thus far to see the same thing you're seeing.

We did figure out why the numbers are skewed, though. If you have a host (Host E) that has 3 parents defined (Host D, C, B), and each of those parents has the same parent defined (Host A) - then for whatever reason Host E is counted 3 times. So an outage in that case would count for 7 hosts, even though only 5 were truly affected. This same weird math occurs for services as well. We'll be looking into this further and possibly filing a bug.

Anyway, I'll keep you posted.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Nagios Enterprises
Senior Developer
Locked