Page 1 of 2
Network broke and killed my nagios
Posted: Tue Jan 27, 2015 2:39 pm
by BanditBBS
Hey all,
I just had a network issue stoppign my nagios from contacting my 2nd datacenter. This included about 450 hosts and 5000 services. Process count on nagios jumped to 8000+ as it was rescheduling checks of the hosts and all the services. Is there a setting somewhere to tell it not to check services if a host is down?
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 2:56 pm
by snapon_admin
As someone else who has also experienced this issue I would also like something like this. I have never found any such setting, but it's also the reason for my "disable checks during scheduled downtime" feature request. When we have large downtimes and checks are being rescheduled at an alarming rate it has a tendency to seize my Nagios server.
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 2:58 pm
by BanditBBS
Well, i now feel your pain. Basically my monitoring of the DC the nagios server is in was useless as the server was crippled. Hopefully they have some work around they can suggest....but I fear they don't

Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 3:25 pm
by tmcdonald
Isn't this just a dependency? Or am I missing something?
http://nagios.sourceforge.net/docs/3_0/ ... ncies.html
Service dependencies can be used to cause service check execution and service notifications to be suppressed under different circumstances (OK, WARNING, UNKNOWN, and/or CRITICAL states)
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 3:27 pm
by snapon_admin
That suppresses notifications but all of the checks still run. That amount of checks changing state all at once is what causes the Nagios server to have a heart attack.
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 3:32 pm
by tmcdonald
It should be suppressing checks as well. From the documentation:
execution_failure_criteria: This directive is used to specify the criteria that determine when the dependent service should not be actively checked. If the master service is in one of the failure states we specify, the dependent service will not be actively checked.
http://nagios.sourceforge.net/docs/3_0/ ... dependency
If it is not actually suppressing the checks then this is a bug we need to file.
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 3:34 pm
by BanditBBS
What Trevor pointed out can stop execution as well. However, shouldn't that be the intuitive way nagios works anyway, services dependent on the host they are on. Then we could just say suppress execution is host is down...or whatever.
but to use service and host dependencies, that means for every host I have, I have to create a separate config of dependencies to stop these from happening, right? That sure isnt going to be fun.
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 3:48 pm
by tmcdonald
BanditBBS wrote:What Trevor pointed out can stop execution as well. However, shouldn't that be the intuitive way nagios works anyway, services dependent on the host they are on.
Not necessarily. Not every host check is going to be a ping. It is entirely possible depending on the host check (though maybe not likely) that a host is down but the services are up. Or a firewall might start blocking pings (host is down) but allow HTTP (Web service is up).
BanditBBS wrote:That sure isnt going to be fun.
Sure won't!
Unfortunately I can't really think of a better way to do this, really. Might be a cool idea to add a feature in Core that allows you to specify a bool like "depends_on_host" or something, but that would be a bit of an overhaul I'd imagine.
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 3:51 pm
by lmiltchev
What Trevor pointed out can stop execution as well.
That is correct:
Servicedependency - execution failure criteria
This directive is used to specify the criteria that determine when the dependent service should not be actively checked. If the master service is in one of the failure states we specify, the dependent service will not be actively checked. Valid options are a combination of one or more of the following (multiple options are separated with commas):
o = fail on an OK state,
w = fail on a WARNING state,
u = fail on an UNKNOWN state,
c = fail on a CRITICAL state, and
p = fail on a pending state (e.g. the service has not yet been checked).
If you specify n (none) as an option, the execution dependency will never fail and checks of the dependent service will always be actively checked (if other conditions allow for it to be).
Example: If you specify o,c,u in this field, the dependent service will not be actively checked if the master service is in either an OK, a CRITICAL, or an UNKNOWN state.
Parameter name: execution_failure_criteria
but as BanditBBS said, it is dependent on other services not on the host. It's not a perfect solution. Some people tried to use event handlers by doing the following: if the host is DOWN or UNREACHABLE, it sends back to Nagios an "external command" to disable all active service checks. If the status of the host is UP, then it sends the external command to enable all service checks for that particular host. This creates some "latency" and other issues, so I guess this is not a great solution either.
There is a bug report on our bug tracker from 2012:
http://tracker.nagios.com/view.php?id=297
I am not sure if it's been moved to the Core tracker.
Re: Network broke and killed my nagios
Posted: Tue Jan 27, 2015 3:57 pm
by BanditBBS
http://tracker.nagios.org/view.php?id=666
God I really hope that is something that could be easily added. It really is detrimental in large installs for big downtimes and/or network outages.
snapon, go +1 my request
Ludmil - you mean this one?
http://old.nagios.org/developerinfo/ext ... mand_id=34 and can you go into more detail on:
This creates some "latency" and other issues, so I guess this is not a great solution either.