Page 1 of 8

Server issues when multiple hosts were down

Posted: Mon May 18, 2015 10:54 am
by BanditBBS
Ok, let me describe what was going on here and what I was seeing out of the Nagios server over this weekend.
  • Currently I have ~1000 hosts(avg 9 down) and ~16000 services(avg 800 issues) being monitored by my XI 2014r2.6 server.
  • Average load is normally 1.5-3.0 and ~500 total processes.
  • This weekend we had major work being performed in one of our datacenters that caused us to down ~300 hosts and ~4800 additional services.
  • I scheduled downtimes for all the services and hosts(thank god for scripting!).
Here are the load and total process charts for the past 48 hours:
localhost-current_load.jpg
localhost-total_processes.jpg
You can tell when the maintenance was by the number of processes and you can see the actual load on the server was not affected. The number of processes was running high because we give checks 660 seconds to finish and it was taking that time to timeout trying to do all the check_oracle_health checks it was trying during the outage time. If you browsed nagios you could see all checks were still being performed and notifications were being sent and everything else was working properly. The issue I was having was only affecting administrators, check out my dashboard:
nagios_status.JPG
I wasn't getting any status back from the checking script apparently. I do have mysql and ndo2db both offloaded onto another server. Both my XI server and the offloaded server are in the same VM cluster in a DC that was not being affected by the maintenance. I tried bouncing services a few times and it's work for maybe 1 minute and then go back to the same. As soon the outage was over everything resolved itself and I didn't need to do anything. Maybe a server came up that it relies on somehow?

So, I have 2 questions out of this mess...

1. The check that runs and validates server performance and stuff, can you think of any reason it wouldn't be working proerly during this mayhem or anything it may rely on in the script when the items are offloaded like I have them?
2. Is there any setting I can make that automatically makes services dependent upon their hosts? I'd love to set that up so checks are not performed while the host is down. I know that isn't default behavior, but I don't want to have to create 1000+ dependency configs.

Re: Server issues when multiple hosts were down

Posted: Mon May 18, 2015 5:15 pm
by tmcdonald
I'll address #1, but basically that is all handled by cron. Can you send over the logs under /usr/local/nagiosxi/var/ for that time? Specifically the sysstat.log-xxxxxx and possibly cmdsubsys.log-xxxxxx files.

Re: Server issues when multiple hosts were down

Posted: Mon May 18, 2015 9:32 pm
by BanditBBS
tmcdonald wrote:I'll address #1, but basically that is all handled by cron. Can you send over the logs under /usr/local/nagiosxi/var/ for that time? Specifically the sysstat.log-xxxxxx and possibly cmdsubsys.log-xxxxxx files.
Does #2 scare you?

Logs sent and there are errors in the one that you'll see.

Re: Server issues when multiple hosts were down

Posted: Tue May 19, 2015 3:12 am
by Fred Kroeger
Would definitely vote for #2 to be added as a Default feature.
It makes no sense to me to schedule service checks if the Host is Down?
Service Dependencies are not the answer to this - besides it is still broken if you implement it with Hostgroups instead of individual hosts.

Regards... Fred

Re: Server issues when multiple hosts were down

Posted: Tue May 19, 2015 7:55 am
by snapon_admin
Also vote for number 2, I even think I have a feature request to add an option to disable service checks during downtime. We had a big maintenance event this weekend as well that borked up some of the checks because so many service checks were critical. Just checked and yep, here's my feature request for an option to disable checks while a host is in downtime which was closed because it would be difficult for this feature to get traction: http://tracker.nagios.com/view.php?id=584

Just from looking for my feature request I also found another one very similar to it, also closed: http://tracker.nagios.com/view.php?id=658

Re: Server issues when multiple hosts were down

Posted: Tue May 19, 2015 8:07 am
by BanditBBS
Yeah, I've had this discussion on the forums with them before and they swear there are reasons to keep checking services if a host is down. Someone even tried explaining an example....I knew I wasn't going to get anywhere so I just dropped it :(

Sure does sound like we have a few people at least that would love this feature. It'd have to be an option added to Core though to either stop checking during downtime or make services dependent on host automatically. They'd have to make it an option and not default though as they'd be changing behavior that has been in place for 15 years, but god would I kill for the option!

Re: Server issues when multiple hosts were down

Posted: Tue May 19, 2015 8:25 am
by snapon_admin
Right, exactly. My suggestion was for an option rather than default behavior because the exact reason you stated, changing something like that that's been the way it is for as long as it has would be a bad idea. And the main reason given for continuing checks during downtime is because it would screw with reports since there's an option to include/exclude downtime. I do agree though, looks like there's at least some interest in this as a feature and I would also love it if this option ever made it into Nagios.

Re: Server issues when multiple hosts were down

Posted: Tue May 19, 2015 8:34 am
by WillemDH
Seems very reasonable to have the option to not do service checks while the host is down. I would think most prefer this option above the Nagios server going down.
Please make it a global option. :)

Re: Server issues when multiple hosts were down

Posted: Tue May 19, 2015 8:43 am
by BanditBBS
WillemDH wrote:I would think most prefer this option above the Nagios server going down.
Excuse me for a moment, I'll be in the corner doing this: HAHAHAHAHAHAAHAHAHAHAHAHAHAHAHAAHAHAHAHAHA

Couldn't have said it better myself :ugeek:

Re: Server issues when multiple hosts were down

Posted: Tue May 19, 2015 9:15 am
by vAJ
I too had a major outage (thanks MS Hyper-V) and my Nagios instance took a major dump.

Took several restarts and clearing state to get back to good.