Hey all,
I just had a network issue stoppign my nagios from contacting my 2nd datacenter. This included about 450 hosts and 5000 services. Process count on nagios jumped to 8000+ as it was rescheduling checks of the hosts and all the services. Is there a setting somewhere to tell it not to check services if a host is down?
Network broke and killed my nagios
Network broke and killed my nagios
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
- snapon_admin
- Posts: 952
- Joined: Mon Jun 10, 2013 10:39 am
- Location: Kenosha, WI
- Contact:
Re: Network broke and killed my nagios
As someone else who has also experienced this issue I would also like something like this. I have never found any such setting, but it's also the reason for my "disable checks during scheduled downtime" feature request. When we have large downtimes and checks are being rescheduled at an alarming rate it has a tendency to seize my Nagios server.
Re: Network broke and killed my nagios
Well, i now feel your pain. Basically my monitoring of the DC the nagios server is in was useless as the server was crippled. Hopefully they have some work around they can suggest....but I fear they don't 
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Network broke and killed my nagios
Isn't this just a dependency? Or am I missing something?
http://nagios.sourceforge.net/docs/3_0/ ... ncies.html
http://nagios.sourceforge.net/docs/3_0/ ... ncies.html
Service dependencies can be used to cause service check execution and service notifications to be suppressed under different circumstances (OK, WARNING, UNKNOWN, and/or CRITICAL states)
Former Nagios employee
- snapon_admin
- Posts: 952
- Joined: Mon Jun 10, 2013 10:39 am
- Location: Kenosha, WI
- Contact:
Re: Network broke and killed my nagios
That suppresses notifications but all of the checks still run. That amount of checks changing state all at once is what causes the Nagios server to have a heart attack.
Re: Network broke and killed my nagios
It should be suppressing checks as well. From the documentation:
If it is not actually suppressing the checks then this is a bug we need to file.
http://nagios.sourceforge.net/docs/3_0/ ... dependencyexecution_failure_criteria: This directive is used to specify the criteria that determine when the dependent service should not be actively checked. If the master service is in one of the failure states we specify, the dependent service will not be actively checked.
If it is not actually suppressing the checks then this is a bug we need to file.
Former Nagios employee
Re: Network broke and killed my nagios
What Trevor pointed out can stop execution as well. However, shouldn't that be the intuitive way nagios works anyway, services dependent on the host they are on. Then we could just say suppress execution is host is down...or whatever.
but to use service and host dependencies, that means for every host I have, I have to create a separate config of dependencies to stop these from happening, right? That sure isnt going to be fun.
but to use service and host dependencies, that means for every host I have, I have to create a separate config of dependencies to stop these from happening, right? That sure isnt going to be fun.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Network broke and killed my nagios
Not necessarily. Not every host check is going to be a ping. It is entirely possible depending on the host check (though maybe not likely) that a host is down but the services are up. Or a firewall might start blocking pings (host is down) but allow HTTP (Web service is up).BanditBBS wrote:What Trevor pointed out can stop execution as well. However, shouldn't that be the intuitive way nagios works anyway, services dependent on the host they are on.
Sure won't!BanditBBS wrote:That sure isnt going to be fun.
Unfortunately I can't really think of a better way to do this, really. Might be a cool idea to add a feature in Core that allows you to specify a bool like "depends_on_host" or something, but that would be a bit of an overhaul I'd imagine.
Former Nagios employee
Re: Network broke and killed my nagios
That is correct:What Trevor pointed out can stop execution as well.
but as BanditBBS said, it is dependent on other services not on the host. It's not a perfect solution. Some people tried to use event handlers by doing the following: if the host is DOWN or UNREACHABLE, it sends back to Nagios an "external command" to disable all active service checks. If the status of the host is UP, then it sends the external command to enable all service checks for that particular host. This creates some "latency" and other issues, so I guess this is not a great solution either.Servicedependency - execution failure criteria
This directive is used to specify the criteria that determine when the dependent service should not be actively checked. If the master service is in one of the failure states we specify, the dependent service will not be actively checked. Valid options are a combination of one or more of the following (multiple options are separated with commas):
o = fail on an OK state,
w = fail on a WARNING state,
u = fail on an UNKNOWN state,
c = fail on a CRITICAL state, and
p = fail on a pending state (e.g. the service has not yet been checked).
If you specify n (none) as an option, the execution dependency will never fail and checks of the dependent service will always be actively checked (if other conditions allow for it to be).
Example: If you specify o,c,u in this field, the dependent service will not be actively checked if the master service is in either an OK, a CRITICAL, or an UNKNOWN state.
Parameter name: execution_failure_criteria
There is a bug report on our bug tracker from 2012:
http://tracker.nagios.com/view.php?id=297
I am not sure if it's been moved to the Core tracker.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Network broke and killed my nagios
http://tracker.nagios.org/view.php?id=666
God I really hope that is something that could be easily added. It really is detrimental in large installs for big downtimes and/or network outages.
snapon, go +1 my request
Ludmil - you mean this one? http://old.nagios.org/developerinfo/ext ... mand_id=34 and can you go into more detail on:
God I really hope that is something that could be easily added. It really is detrimental in large installs for big downtimes and/or network outages.
snapon, go +1 my request
Ludmil - you mean this one? http://old.nagios.org/developerinfo/ext ... mand_id=34 and can you go into more detail on:
This creates some "latency" and other issues, so I guess this is not a great solution either.
Last edited by BanditBBS on Tue Jan 27, 2015 4:21 pm, edited 1 time in total.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github