Parent/Child Relationships head scratcher

mcampbell · Post by **mcampbell** » Wed May 29, 2013 4:06 pm

So I have a question. I have two Nagios setups that are up and configured, and well to my liking, save for one point--parent/child relationships. I went ahead and added the parents field to all of my hosts, disabled unknown notifications, and now the map reflects fairly accurately our network, as nagios sees it. I also set up some monitoring of some resources external to our office, and set our router as the parent to these resources. When our router died recently (something Nagios acknowledged), I was deluged with dozens upon dozens of emails about each and every host we monitor outside of our network being down. I've been baffled by this, as the map seems to acknowledge the correct layout, indicating to me that it has the right information on the parent/child relationships. I've read through a bunch of how-tos on the web, and they all say that what I've done should be enough, that you simply put in the parents field with the appropriate hostname, and Nagios is smart enough to take care of the rest. Is there something that I'm missing here?

sreinhardt · Post by **sreinhardt** » Thu May 30, 2013 10:46 am

Could your checks on those hosts\services have started before the parent relationship failed? It sounds like you do have things setup correctly, however you may need to take into account that if nagios detects an error with something prior to the parent relationship going down it may also alert for that\those issues also detected.

mcampbell · Post by **mcampbell** » Thu May 30, 2013 10:56 am

That does makes sense to me, however, once it determined that the router was in fact down, wouldn't it change the child hosts to unreachable? In my recent experience with the downed router, it did determine that the router was down, but never changed the others to unreachable, and continued sending me emails for each child once every 10 minutes for the few hours it was down..

abrist · Post by **abrist** » Thu May 30, 2013 3:00 pm

mcampbell wrote:That does makes sense to me, however, once it determined that the router was in fact down, wouldn't it change the child hosts to unreachable?

It absolutely should. Could you post a few examples, specifically, the parent config, one of the children's configs, and one of the service configs that were alerting after the outage?

mcampbell · Post by **mcampbell** » Thu May 30, 2013 4:08 pm

Sure thing.

Here's the router config:

Code: Select all

define host{
        name					generic-host; The name of this host template
        check_period				24x7		; By default, hosts are checked round the clock
        check_command			check-host-alive; Default command to check if servers are "alive"
        check_interval				1		; Actively check the host every 5 minutes
        retry_interval				1		; Schedule host check retries at 1 minute intervals
        notifications_enabled			1		; Host notifications are enabled
        event_handler_enabled		1		; Host event handler is enabled
        max_check_attempts			5		; Check eachhost 15 times (max)
        flap_detection_enabled		1		; Flap detection is enabled
        failure_prediction_enabled		1		; Failure prediction is enabled
        process_perf_data			1		; Process performance data
        retain_status_information		1		; Retain status information across program restarts
        retain_nonstatus_information	1		; Retain non-status information across program restart
        notification_period			24x7		; Send host notifications at any time
        notification_options			d,r		; Only send notifications for down & recovered states
        contact_groups				admins,afterhours; Notifications get sent to the admins by default
        register					0		; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

define host{
        name			generic-switch	; The name of this host template
        use				generic-host	; Inherit default values from the generic-host template
        check_period		24x7			; By default, switches are monitored round the clock
        check_interval		2			; Switches are checked every 5 minutes
        retry_interval		1			; Schedule host check retries at 1 minute intervals
        max_check_attempts	5			; Check each switch 10 times (max)
        check_command	check-host-alive        ; Default command to check if routers are "alive"
        notification_period	24x7			; Send notifications at any time
        notification_interval	10			; Resend notifications every 30 minutes
        notification_options	d,r			; Only send notifications for specific host states
        statusmap_image	switch.png
        register			0			; DONT REGISTER THIS - ITS JUST A TEMPLATE
        }

define host{
        name			generic-router	; The name of this host template
        use				generic-switch	; Inherit default values from the generic-host template
        statusmap_image	firewall.png
        register			0			; DONT REGISTER THIS - ITS JUST A TEMPLATE
        }

define host{
        use				generic-router
        host_name		pfsense
        address			10.0.0.1
        parents			bsh_office
        host_groups		Routers and Switches
        }

And here's a couple of the external ones that complained:

Code: Select all

define host{
        name			24x7-server	; The name of this host template
        use				generic-host	; Inherit default values from the generic-host template
        notification_period	24x7			; Send notification out at any time - day or night
        notification_interval	10			; Resend notifications every 30 minutes
        register			0			; DONT REGISTER THIS - ITS JUST A TEMPLATE
        }

define host{
        use				24x7-server
        host_name		098
        address			<redacted>.98
        parents			pfsense
        statusmap_image	remoteserver.png
        host_groups		External Servers
        }

define host{
        use				24x7-server
        host_name		099
        address			<redacted>.99
        parents			pfsense
        statusmap_image	remoteserver.png
        host_groups		External Servers
        }

After looking through my email log for the hosts & services that alerted me, I discovered that Nagios didn't actually alert me about any services. They still didn't show up as unreachable I'm pretty sure, but I didn't receive an email for it.

abrist · Post by **abrist** » Fri May 31, 2013 11:08 am

You configs look good. When a host is down or unreachable, nagios stop doing active service checks to that host in order to reduce unnecessary checking and alerts. The hosts should not have notified either though if their parent was down. I presume that all the notifications you received were concerning unreachable hosts behind a down parent?

mcampbell · Post by **mcampbell** » Fri May 31, 2013 1:16 pm

For the purposes of this discussion, yes, it's just the ones behind the router. I did discover some custom entries monitoring our esxi server complaining, but I discovered that when I created the commands for those checks, that I accidentally told it to use $HOSTNAME$ instead of $HOSTADDRESS$. And since our router serves up dns via dhcp, that failed when it died, but I fixed those so it points to IP address instead.

Now while preparing additional data for this post, I just discovered something. I was looking at the log file navigation in Nagios, specifically pertaining to the pfsense router, and I discovered that while pfsense has 5 tries before it sends an email (per its max_check_attempts flag), the logs only show it going to the second try. Could it be that the others are not getting converted to unreachable because the pfsense router checks don't make it to 5, thus not converting it to a hard down?

[05-23-2013 14:33:07] HOST ALERT: pfsense;DOWN;SOFT;2;CRITICAL - Host Unreachable (10.0.0.1)
[05-23-2013 14:29:21] HOST ALERT: pfsense;DOWN;SOFT;1;CRITICAL - Host Unreachable (10.0.0.1)

abrist · Post by **abrist** » Fri May 31, 2013 2:26 pm

mcampbell wrote: Could it be that the others are not getting converted to unreachable because the pfsense router checks don't make it to 5, thus not converting it to a hard down?

This could most definitely be it. Did the pfsense router ever go fully down, or did it recover?

mcampbell · Post by **mcampbell** » Fri May 31, 2013 3:02 pm

Over the course of that day, the router had gone down several times; There were some recoveries, but at one point, it was pretty hard down for some time--each time the pfsense OS kernel would panic due to some flaky hardware (long story

), but when it started panicking only minutes after it rebooted, it required bringing up an entirely new one, during which it was down for a good half hour or more.

mcampbell · Post by **mcampbell** » Mon Jun 03, 2013 8:08 am

So I guess the next question really is, why is Nagios only checking it twice, instead of all 5 times that it's supposed to?

Nagios Support Forum

Parent/Child Relationships head scratcher

Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher

Re: Parent/Child Relationships head scratcher