Parent/Child Relationships head scratcher
Parent/Child Relationships head scratcher
So I have a question. I have two Nagios setups that are up and configured, and well to my liking, save for one point--parent/child relationships. I went ahead and added the parents field to all of my hosts, disabled unknown notifications, and now the map reflects fairly accurately our network, as nagios sees it. I also set up some monitoring of some resources external to our office, and set our router as the parent to these resources. When our router died recently (something Nagios acknowledged), I was deluged with dozens upon dozens of emails about each and every host we monitor outside of our network being down. I've been baffled by this, as the map seems to acknowledge the correct layout, indicating to me that it has the right information on the parent/child relationships. I've read through a bunch of how-tos on the web, and they all say that what I've done should be enough, that you simply put in the parents field with the appropriate hostname, and Nagios is smart enough to take care of the rest. Is there something that I'm missing here?
-
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: Parent/Child Relationships head scratcher
Could your checks on those hosts\services have started before the parent relationship failed? It sounds like you do have things setup correctly, however you may need to take into account that if nagios detects an error with something prior to the parent relationship going down it may also alert for that\those issues also detected.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Re: Parent/Child Relationships head scratcher
That does makes sense to me, however, once it determined that the router was in fact down, wouldn't it change the child hosts to unreachable? In my recent experience with the downed router, it did determine that the router was down, but never changed the others to unreachable, and continued sending me emails for each child once every 10 minutes for the few hours it was down..
Re: Parent/Child Relationships head scratcher
It absolutely should. Could you post a few examples, specifically, the parent config, one of the children's configs, and one of the service configs that were alerting after the outage?mcampbell wrote:That does makes sense to me, however, once it determined that the router was in fact down, wouldn't it change the child hosts to unreachable?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Parent/Child Relationships head scratcher
Sure thing.
Here's the router config:
And here's a couple of the external ones that complained:
After looking through my email log for the hosts & services that alerted me, I discovered that Nagios didn't actually alert me about any services. They still didn't show up as unreachable I'm pretty sure, but I didn't receive an email for it.
Here's the router config:
Code: Select all
define host{
name generic-host; The name of this host template
check_period 24x7 ; By default, hosts are checked round the clock
check_command check-host-alive; Default command to check if servers are "alive"
check_interval 1 ; Actively check the host every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
max_check_attempts 5 ; Check eachhost 15 times (max)
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restart
notification_period 24x7 ; Send host notifications at any time
notification_options d,r ; Only send notifications for down & recovered states
contact_groups admins,afterhours; Notifications get sent to the admins by default
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
define host{
name generic-switch ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
check_period 24x7 ; By default, switches are monitored round the clock
check_interval 2 ; Switches are checked every 5 minutes
retry_interval 1 ; Schedule host check retries at 1 minute intervals
max_check_attempts 5 ; Check each switch 10 times (max)
check_command check-host-alive ; Default command to check if routers are "alive"
notification_period 24x7 ; Send notifications at any time
notification_interval 10 ; Resend notifications every 30 minutes
notification_options d,r ; Only send notifications for specific host states
statusmap_image switch.png
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
define host{
name generic-router ; The name of this host template
use generic-switch ; Inherit default values from the generic-host template
statusmap_image firewall.png
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
define host{
use generic-router
host_name pfsense
address 10.0.0.1
parents bsh_office
host_groups Routers and Switches
}
Code: Select all
define host{
name 24x7-server ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
notification_period 24x7 ; Send notification out at any time - day or night
notification_interval 10 ; Resend notifications every 30 minutes
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
define host{
use 24x7-server
host_name 098
address <redacted>.98
parents pfsense
statusmap_image remoteserver.png
host_groups External Servers
}
define host{
use 24x7-server
host_name 099
address <redacted>.99
parents pfsense
statusmap_image remoteserver.png
host_groups External Servers
}
Re: Parent/Child Relationships head scratcher
You configs look good. When a host is down or unreachable, nagios stop doing active service checks to that host in order to reduce unnecessary checking and alerts. The hosts should not have notified either though if their parent was down. I presume that all the notifications you received were concerning unreachable hosts behind a down parent?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Parent/Child Relationships head scratcher
For the purposes of this discussion, yes, it's just the ones behind the router. I did discover some custom entries monitoring our esxi server complaining, but I discovered that when I created the commands for those checks, that I accidentally told it to use $HOSTNAME$ instead of $HOSTADDRESS$. And since our router serves up dns via dhcp, that failed when it died, but I fixed those so it points to IP address instead.
Now while preparing additional data for this post, I just discovered something. I was looking at the log file navigation in Nagios, specifically pertaining to the pfsense router, and I discovered that while pfsense has 5 tries before it sends an email (per its max_check_attempts flag), the logs only show it going to the second try. Could it be that the others are not getting converted to unreachable because the pfsense router checks don't make it to 5, thus not converting it to a hard down?
[05-23-2013 14:33:07] HOST ALERT: pfsense;DOWN;SOFT;2;CRITICAL - Host Unreachable (10.0.0.1)
[05-23-2013 14:29:21] HOST ALERT: pfsense;DOWN;SOFT;1;CRITICAL - Host Unreachable (10.0.0.1)
Now while preparing additional data for this post, I just discovered something. I was looking at the log file navigation in Nagios, specifically pertaining to the pfsense router, and I discovered that while pfsense has 5 tries before it sends an email (per its max_check_attempts flag), the logs only show it going to the second try. Could it be that the others are not getting converted to unreachable because the pfsense router checks don't make it to 5, thus not converting it to a hard down?
[05-23-2013 14:33:07] HOST ALERT: pfsense;DOWN;SOFT;2;CRITICAL - Host Unreachable (10.0.0.1)
[05-23-2013 14:29:21] HOST ALERT: pfsense;DOWN;SOFT;1;CRITICAL - Host Unreachable (10.0.0.1)
Re: Parent/Child Relationships head scratcher
This could most definitely be it. Did the pfsense router ever go fully down, or did it recover?mcampbell wrote: Could it be that the others are not getting converted to unreachable because the pfsense router checks don't make it to 5, thus not converting it to a hard down?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Parent/Child Relationships head scratcher
Over the course of that day, the router had gone down several times; There were some recoveries, but at one point, it was pretty hard down for some time--each time the pfsense OS kernel would panic due to some flaky hardware (long story ), but when it started panicking only minutes after it rebooted, it required bringing up an entirely new one, during which it was down for a good half hour or more.
Re: Parent/Child Relationships head scratcher
So I guess the next question really is, why is Nagios only checking it twice, instead of all 5 times that it's supposed to?