Parent Relationship Not Suppressing Downstream Notifications

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Smark
Posts: 32
Joined: Tue Jan 08, 2013 6:12 pm

Parent Relationship Not Suppressing Downstream Notifications

Post by Smark »

Hello,

We have our configuration structured like this:

[Host: Nagios Server]
--> XXX.XXX.217.1 [Host: Nagios' Subnet Uplink]
----> XXX.XXX.218.1 [Host: Remote Host Subnet Uplink]
------> XXX.XXX.218.20 [Host: Remote Host]
--------> [Service: Remote Host Service]

As I understand it, if 218.1 is unreachable I should not get notifications for 218.20 or the services running on it. If 218.20 appears as down, Nagios will immediately check it's parent (218.1) to see if it is responding. If it is, then send an alert for 218.20, if it is not, check 217.1 and so on.

Last night we had an outage somewhere above the 218.20 level (either 218.1 or 217.1) and we still got notifications for all of our hosts on the 218 subnet. Is there something I'm missing? At the moment we have 218.20's parent set to the 218.1 host, and 218.1's parent set to the 217.1 host. Is this correct? Where am I going wrong?

Thanks!
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by tmcdonald »

Can you post the configs for those relationships? If you need to censor anything out please do.
Former Nagios employee
Smark
Posts: 32
Joined: Tue Jan 08, 2013 6:12 pm

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by Smark »

In reality the path from Nagios (217.140) to XXXDC01 (218.49) goes like this:
217.150 (Nagios - OfficeThree)
217.1 (Router - OfficeThree)
179.10 (ISP Uplink - OfficeThree)
179.9 (ISP Uplink - OfficeTwo)
179.17 / 179.18 (Redundant ISP Uplinks - OfficeOne)
218.1 (Router - OfficeOne)
218.49 (XXXDC01 - OfficeOne)

The path physically goes from Nagios (OfficeThree) to OfficeTwo to OfficeOne to the local router in OfficeOne to the end host XXXDC01.

Sorry if this appears confusing, but these were all created by NagiosQL:

Code: Select all

define host {
        host_name                       XXXDC01
        use                             Domain Controller
        alias                           XXXDC01
        display_name                    XXXDC01
        address                         XXX.XXX.218.49
        parents                         XXX.XXX.218.1 Switch OfficeOne
        hostgroups                      Domain Controllers Prod
        check_command                   check-host-alive!!!!!!!!
        max_check_attempts              5
        register                        1
        }

define host {
        host_name                       XXX.XXX.218.1 Switch OfficeOne
        use                             XXX Switch
        alias                           XXX.XXX.218.1 Switch OfficeOne
        display_name                    XXX.XXX.218.1 Switch OfficeOne
        address                         XXX.XXX.218.1
        parents                         XXX.XXX.179.17 GlobalNet OfficeOne ISPName Backup,XXX.XXX.179.18 GlobalNet OfficeOne ISPName Primary
        hostgroups                      XXX Switches
        check_command                   check-host-alive!!!!!!!!
        max_check_attempts              5
        register                        1
        }

define host {
        host_name                       XXX.XXX.179.17 GlobalNet OfficeOne ISPName Backup
        use                             XXX Switch
        alias                           XXX.XXX.179.17 GlobalNet OfficeOne ISPName Backup
        display_name                    XXX.XXX.179.17 GlobalNet OfficeOne ISPName Backup
        address                         XXX.XXX.179.17
        parents                         XXX.XXX.179.9 GlobalNet OfficeTwo ISPName
        hostgroups                      XXX Switches
        check_command                   check-host-alive!!!!!!!!
        max_check_attempts              5
        register                        1
        }

define host {
        host_name                       XXX.XXX.179.18 GlobalNet OfficeOne ISPName Primary
        use                             XXX Switch
        alias                           XXX.XXX.179.18 GlobalNet OfficeOne ISPName Primary
        display_name                    XXX.XXX.179.18 GlobalNet OfficeOne ISPName Primary
        address                         XXX.XXX.179.18
        parents                         XXX.XXX.179.9 GlobalNet OfficeTwo ISPName
        hostgroups                      XXX Switches
        check_command                   check-host-alive!!!!!!!!
        max_check_attempts              5
        register                        1
        }

define host {
        host_name                       XXX.XXX.179.9 GlobalNet OfficeTwo ISPName
        use                             YYY Switch
        alias                           XXX.XXX.179.9 GlobalNet OfficeTwo ISPName
        display_name                    XXX.XXX.179.9 GlobalNet OfficeTwo ISPName
        address                         XXX.XXX.179.9
        parents                         XXX.XXX.179.10 GlobalNet OfficeTwo Company
        hostgroups                      YYY Switches
        check_command                   check-host-alive!!!!!!!!
        register                        1
        }

define host {
        host_name                       XXX.XXX.179.10 GlobalNet OfficeTwo Company
        use                             YYY Switch
        alias                           XXX.XXX.179.10 GlobalNet OfficeTwo Company
        display_name                    XXX.XXX.179.10 GlobalNet OfficeTwo Company
        address                         XXX.XXX.179.10
        parents                         XXX.XXX.217 OfficeThree Core
        hostgroups                      YYY Switches
        check_command                   check-host-alive!!!!!!!!
        register                        1
        }

		define host {
        host_name                       XXX.XXX.217 OfficeThree Core
        use                             ZZZ Switch
        alias                           XXX.XXX.217 OfficeThree Core
        display_name                    XXX.XXX.217 OfficeThree Core
        address                         XXX.XXX.217.1
        hostgroups                      ZZZ Switches
        check_command                   check-host-alive!!!!!!!!
        register                        1
        }
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by slansing »

I just want to clear up the first post. When you assign a parent host to another host, if that parent host goes down the host you assigned it to will appear unknown, as well as it's services. As I am sure you are aware this is done to simulate real network reachability, since the child host cannot get through to Nagios if it's parent is offline.
Smark
Posts: 32
Joined: Tue Jan 08, 2013 6:12 pm

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by Smark »

slansing wrote:I just want to clear up the first post. When you assign a parent host to another host, if that parent host goes down the host you assigned it to will appear unknown, as well as it's services. As I am sure you are aware this is done to simulate real network reach-ability, since the child host cannot get through to Nagios if it's parent is offline.
Hi,

Yes. My initial post may have been worded oddly. We use it purely for network-reachability resolutions. The only hosts that are parents are switching equipment. For hosts, their parent is their local switch. For switches, it's their local router.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by scottwilkerson »

Smark wrote: Last night we had an outage somewhere above the 218.20 level (either 218.1 or 217.1) and we still got notifications for all of our hosts on the 218 subnet. Is there something I'm missing? At the moment we have 218.20's parent set to the 218.1 host, and 218.1's parent set to the 217.1 host. Is this correct? Where am I going wrong?

Thanks!
Did the notification say the hosts were DOWN or UNREACHABLE?

Many people disable UNREACHABLE notifications to eliminate this situation.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Smark
Posts: 32
Joined: Tue Jan 08, 2013 6:12 pm

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by Smark »

scottwilkerson wrote:
Smark wrote: Last night we had an outage somewhere above the 218.20 level (either 218.1 or 217.1) and we still got notifications for all of our hosts on the 218 subnet. Is there something I'm missing? At the moment we have 218.20's parent set to the 218.1 host, and 218.1's parent set to the 217.1 host. Is this correct? Where am I going wrong?

Thanks!
Did the notification say the hosts were DOWN or UNREACHABLE?

Many people disable UNREACHABLE notifications to eliminate this situation.
The notification said "DOWN". Like you said, we also have UNREACHABLE notifications disabled.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by scottwilkerson »

I did notice some of these have the following specified, and others do not

Code: Select all

max_check_attempts              5
It is possible that the items that do not would report HARD DOWN before max_check_attempts made the parent down.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Smark
Posts: 32
Joined: Tue Jan 08, 2013 6:12 pm

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by Smark »

scottwilkerson wrote:I did notice some of these have the following specified, and others do not

Code: Select all

max_check_attempts              5
It is possible that the items that do not would report HARD DOWN before max_check_attempts made the parent down.
So, that part of our network has an inherently unreliable connection, so it's check attempts are 5, all other objects in Nagios are 2. Everything in that portion of the network (meaning the switches and the hosts) has the max attempts set to 5.

If I'm understanding correctly, when an end host gets a hard down, Nagios goes to do a full check of it's parent. As what point does it say "this parent is up/down"? Is it after max_check_attempts, or some other setting?
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Parent Relationship Not Suppressing Downstream Notificat

Post by scottwilkerson »

DOWN after it goes HARD down (down for max_check_attempts * retry_interval)
UP after a check attempt comes back UP
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked