Server issues when multiple hosts were down

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Server issues when multiple hosts were down

Post by BanditBBS »

Ok, let me describe what was going on here and what I was seeing out of the Nagios server over this weekend.
  • Currently I have ~1000 hosts(avg 9 down) and ~16000 services(avg 800 issues) being monitored by my XI 2014r2.6 server.
  • Average load is normally 1.5-3.0 and ~500 total processes.
  • This weekend we had major work being performed in one of our datacenters that caused us to down ~300 hosts and ~4800 additional services.
  • I scheduled downtimes for all the services and hosts(thank god for scripting!).
Here are the load and total process charts for the past 48 hours:
localhost-current_load.jpg
localhost-total_processes.jpg
You can tell when the maintenance was by the number of processes and you can see the actual load on the server was not affected. The number of processes was running high because we give checks 660 seconds to finish and it was taking that time to timeout trying to do all the check_oracle_health checks it was trying during the outage time. If you browsed nagios you could see all checks were still being performed and notifications were being sent and everything else was working properly. The issue I was having was only affecting administrators, check out my dashboard:
nagios_status.JPG
I wasn't getting any status back from the checking script apparently. I do have mysql and ndo2db both offloaded onto another server. Both my XI server and the offloaded server are in the same VM cluster in a DC that was not being affected by the maintenance. I tried bouncing services a few times and it's work for maybe 1 minute and then go back to the same. As soon the outage was over everything resolved itself and I didn't need to do anything. Maybe a server came up that it relies on somehow?

So, I have 2 questions out of this mess...

1. The check that runs and validates server performance and stuff, can you think of any reason it wouldn't be working proerly during this mayhem or anything it may rely on in the script when the items are offloaded like I have them?
2. Is there any setting I can make that automatically makes services dependent upon their hosts? I'd love to set that up so checks are not performed while the host is down. I know that isn't default behavior, but I don't want to have to create 1000+ dependency configs.
You do not have the required permissions to view the files attached to this post.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Server issues when multiple hosts were down

Post by tmcdonald »

I'll address #1, but basically that is all handled by cron. Can you send over the logs under /usr/local/nagiosxi/var/ for that time? Specifically the sysstat.log-xxxxxx and possibly cmdsubsys.log-xxxxxx files.
Former Nagios employee
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Server issues when multiple hosts were down

Post by BanditBBS »

tmcdonald wrote:I'll address #1, but basically that is all handled by cron. Can you send over the logs under /usr/local/nagiosxi/var/ for that time? Specifically the sysstat.log-xxxxxx and possibly cmdsubsys.log-xxxxxx files.
Does #2 scare you?

Logs sent and there are errors in the one that you'll see.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Server issues when multiple hosts were down

Post by Fred Kroeger »

Would definitely vote for #2 to be added as a Default feature.
It makes no sense to me to schedule service checks if the Host is Down?
Service Dependencies are not the answer to this - besides it is still broken if you implement it with Hostgroups instead of individual hosts.

Regards... Fred
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: Server issues when multiple hosts were down

Post by snapon_admin »

Also vote for number 2, I even think I have a feature request to add an option to disable service checks during downtime. We had a big maintenance event this weekend as well that borked up some of the checks because so many service checks were critical. Just checked and yep, here's my feature request for an option to disable checks while a host is in downtime which was closed because it would be difficult for this feature to get traction: http://tracker.nagios.com/view.php?id=584

Just from looking for my feature request I also found another one very similar to it, also closed: http://tracker.nagios.com/view.php?id=658
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Server issues when multiple hosts were down

Post by BanditBBS »

Yeah, I've had this discussion on the forums with them before and they swear there are reasons to keep checking services if a host is down. Someone even tried explaining an example....I knew I wasn't going to get anywhere so I just dropped it :(

Sure does sound like we have a few people at least that would love this feature. It'd have to be an option added to Core though to either stop checking during downtime or make services dependent on host automatically. They'd have to make it an option and not default though as they'd be changing behavior that has been in place for 15 years, but god would I kill for the option!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
snapon_admin
Posts: 952
Joined: Mon Jun 10, 2013 10:39 am
Location: Kenosha, WI
Contact:

Re: Server issues when multiple hosts were down

Post by snapon_admin »

Right, exactly. My suggestion was for an option rather than default behavior because the exact reason you stated, changing something like that that's been the way it is for as long as it has would be a bad idea. And the main reason given for continuing checks during downtime is because it would screw with reports since there's an option to include/exclude downtime. I do agree though, looks like there's at least some interest in this as a feature and I would also love it if this option ever made it into Nagios.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Server issues when multiple hosts were down

Post by WillemDH »

Seems very reasonable to have the option to not do service checks while the host is down. I would think most prefer this option above the Nagios server going down.
Please make it a global option. :)
Nagios XI 5.8.1
https://outsideit.net
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Server issues when multiple hosts were down

Post by BanditBBS »

WillemDH wrote:I would think most prefer this option above the Nagios server going down.
Excuse me for a moment, I'll be in the corner doing this: HAHAHAHAHAHAAHAHAHAHAHAHAHAHAHAAHAHAHAHAHA

Couldn't have said it better myself :ugeek:
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
vAJ
Posts: 456
Joined: Thu Nov 08, 2012 5:09 pm
Location: Austin, TX

Re: Server issues when multiple hosts were down

Post by vAJ »

I too had a major outage (thanks MS Hyper-V) and my Nagios instance took a major dump.

Took several restarts and clearing state to get back to good.
Andrew J. - Do you even grok?
Locked