Network broke and killed my nagios

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Network broke and killed my nagios

Post by BanditBBS »

Hey all,

I just had a network issue stoppign my nagios from contacting my 2nd datacenter. This included about 450 hosts and 5000 services. Process count on nagios jumped to 8000+ as it was rescheduling checks of the hosts and all the services. Is there a setting somewhere to tell it not to check services if a host is down?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: Network broke and killed my nagios

Post by snapon_admin »

As someone else who has also experienced this issue I would also like something like this. I have never found any such setting, but it's also the reason for my "disable checks during scheduled downtime" feature request. When we have large downtimes and checks are being rescheduled at an alarming rate it has a tendency to seize my Nagios server.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Network broke and killed my nagios

Post by BanditBBS »

Well, i now feel your pain. Basically my monitoring of the DC the nagios server is in was useless as the server was crippled. Hopefully they have some work around they can suggest....but I fear they don't :(
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Network broke and killed my nagios

Post by tmcdonald »

Isn't this just a dependency? Or am I missing something?

http://nagios.sourceforge.net/docs/3_0/ ... ncies.html
Service dependencies can be used to cause service check execution and service notifications to be suppressed under different circumstances (OK, WARNING, UNKNOWN, and/or CRITICAL states)
Former Nagios employee
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: Network broke and killed my nagios

Post by snapon_admin »

That suppresses notifications but all of the checks still run. That amount of checks changing state all at once is what causes the Nagios server to have a heart attack.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Network broke and killed my nagios

Post by tmcdonald »

It should be suppressing checks as well. From the documentation:
execution_failure_criteria: This directive is used to specify the criteria that determine when the dependent service should not be actively checked. If the master service is in one of the failure states we specify, the dependent service will not be actively checked.
http://nagios.sourceforge.net/docs/3_0/ ... dependency

If it is not actually suppressing the checks then this is a bug we need to file.
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Network broke and killed my nagios

Post by BanditBBS »

What Trevor pointed out can stop execution as well. However, shouldn't that be the intuitive way nagios works anyway, services dependent on the host they are on. Then we could just say suppress execution is host is down...or whatever.

but to use service and host dependencies, that means for every host I have, I have to create a separate config of dependencies to stop these from happening, right? That sure isnt going to be fun.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Network broke and killed my nagios

Post by tmcdonald »

BanditBBS wrote:What Trevor pointed out can stop execution as well. However, shouldn't that be the intuitive way nagios works anyway, services dependent on the host they are on.
Not necessarily. Not every host check is going to be a ping. It is entirely possible depending on the host check (though maybe not likely) that a host is down but the services are up. Or a firewall might start blocking pings (host is down) but allow HTTP (Web service is up).
BanditBBS wrote:That sure isnt going to be fun.
Sure won't!

Unfortunately I can't really think of a better way to do this, really. Might be a cool idea to add a feature in Core that allows you to specify a bool like "depends_on_host" or something, but that would be a bit of an overhaul I'd imagine.
Former Nagios employee
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Network broke and killed my nagios

Post by lmiltchev »

What Trevor pointed out can stop execution as well.
That is correct:
Servicedependency - execution failure criteria

This directive is used to specify the criteria that determine when the dependent service should not be actively checked. If the master service is in one of the failure states we specify, the dependent service will not be actively checked. Valid options are a combination of one or more of the following (multiple options are separated with commas):
o = fail on an OK state,
w = fail on a WARNING state,
u = fail on an UNKNOWN state,
c = fail on a CRITICAL state, and
p = fail on a pending state (e.g. the service has not yet been checked).
If you specify n (none) as an option, the execution dependency will never fail and checks of the dependent service will always be actively checked (if other conditions allow for it to be).

Example: If you specify o,c,u in this field, the dependent service will not be actively checked if the master service is in either an OK, a CRITICAL, or an UNKNOWN state.

Parameter name: execution_failure_criteria
but as BanditBBS said, it is dependent on other services not on the host. It's not a perfect solution. Some people tried to use event handlers by doing the following: if the host is DOWN or UNREACHABLE, it sends back to Nagios an "external command" to disable all active service checks. If the status of the host is UP, then it sends the external command to enable all service checks for that particular host. This creates some "latency" and other issues, so I guess this is not a great solution either.
There is a bug report on our bug tracker from 2012:

http://tracker.nagios.com/view.php?id=297

I am not sure if it's been moved to the Core tracker.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Network broke and killed my nagios

Post by BanditBBS »

http://tracker.nagios.org/view.php?id=666

God I really hope that is something that could be easily added. It really is detrimental in large installs for big downtimes and/or network outages.

snapon, go +1 my request :)

Ludmil - you mean this one? http://old.nagios.org/developerinfo/ext ... mand_id=34 and can you go into more detail on:
This creates some "latency" and other issues, so I guess this is not a great solution either.
Last edited by BanditBBS on Tue Jan 27, 2015 4:21 pm, edited 1 time in total.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked