Page 1 of 2

reduced (single) alert for one location with many devices

Posted: Wed Jul 20, 2011 2:34 pm
by pnewlon
*** How do I know whether to post questions in this forum or in the general forum? As a customer, do I by default use this one?


I have 34 locations (remote, connected via broadband and VPN tunnels). At each location there are approximately 18 devices (+/- a few) with 2-3 services defined for each device. If the broadband connection goes down for a given remote location, I get 'critical' messages for ALL the services Nagios cannot get to. I turned off all host notifications which reduced resultant email deluge a bit, and turned re-notification down to 8 hours but it is still a pretty big load of email notices to read on a blackberry when 2-3 locations are down for a day or so.

Is there a way to consolidate the devices into a single alert? I thought about using check_cluster but I would get a cluster alert for any single device/service down at a location. When the broadband connection is up, it is nice to only get a specific alert about a single device/service. I have the ability to monitor the router connecting the remote location, though I am not now doing it. If I defined the router (host / service) is there a way to say 'if you can't get to this device, refrain from alerting about all the devices behind it'?

Thanks! Phil

Re: reduced (single) alert for one location with many device

Posted: Thu Jul 21, 2011 12:28 pm
by mguthrie
Moved to Nagios XI Customer Forum....

Re: reduced (single) alert for one location with many device

Posted: Thu Jul 21, 2011 12:33 pm
by mguthrie
What you'll probably want to do is set up host/service dependencies. The following docs are from the Core manual, but they outline the logic fairly well.

http://nagios.sourceforge.net/docs/3_0/ ... ility.html
http://nagios.sourceforge.net/docs/3_0/ ... ncies.html

This might also be a handy tool for you:
http://exchange.nagios.org/directory/Ad ... nt/details

Re: reduced (single) alert for one location with many device

Posted: Fri Jul 22, 2011 6:15 am
by pnewlon
thank you. i will give it a go on monday phil

Re: reduced (single) alert for one location with many device

Posted: Wed Aug 24, 2011 12:53 pm
by pnewlon
Mark - I've finally gotten back to working on this issue this week.

- I created hosts for the router at each location
- I made the router the parent of each host behind it
- I created host dependencies for each host so that it depended on the router and checks/alerts would be disabled if the host is down

Today the router connecting site 2622 went down (critical) and I got alerts from all the devices (services) behind it as they went 'critical' instead of 'unreachable' :( The hosts show green because I am using 'check_dummy' for all hosts but the routers and set 'Ping' up as a service on each host.

Image

Take 02622_OB04-DSP for example, the parent is 02622_RTR

Image

You can also see that it is a 'dependent host'

Image

Depending on host 'o2622_RTR'

Image

Re: reduced (single) alert for one location with many device

Posted: Thu Aug 25, 2011 10:23 am
by nscott
pnewlon,

I see they say critical, are you getting notifications for all of them even though the parent is down. The children will show critical/unreachable status, but will not send notifications. Are you receiving notifications for them?

Re: reduced (single) alert for one location with many device

Posted: Thu Aug 25, 2011 10:59 am
by pnewlon
"Today the router connecting site 2622 went down (critical) and I got alerts from all the devices (services) behind it as they went 'critical' instead of 'unreachable' The hosts show green because I am using 'check_dummy' for all hosts but the routers and set 'Ping' up as a service on each host."

Yes, unfortunately I am - not HOST alerts but SERVICE alerts. I thought I had my belt and suspenders on by creating the parent/child relationship as well as the host dependencies. From how I read the docs, the parent/child relationship stops the HOST alerts and the host dependency configuration stops the active service checks on the dependent hosts which should stop the SERVICE alerts.

Re: reduced (single) alert for one location with many device

Posted: Thu Aug 25, 2011 11:18 am
by nscott
I think the issue here is that Nagios things that the hosts are actually still up, and so is going through and testing the services, which it is finding down, and since the host is up, it sends the service alert. I would suggest changing the check_command on one hosts to reflect its down status. If that resolves your issue than that is issue.

Re: reduced (single) alert for one location with many device

Posted: Thu Aug 25, 2011 3:51 pm
by pnewlon
It doesn't change the 'status' - it remains OK. However, it does stop checking and does not send alerts so that is good. Wonder why, since the services are no longer actively checked that the status doesn't go to 'unknown' or some other relevant status. OK is not relevant when the status of the service is really unknown due to not being checked.

Re: reduced (single) alert for one location with many device

Posted: Fri Aug 26, 2011 9:35 am
by nscott
Thats a good point, i'll look into that. But are your problems resolved?