Ok, let me describe what was going on here and what I was seeing out of the Nagios server over this weekend.
- Currently I have ~1000 hosts(avg 9 down) and ~16000 services(avg 800 issues) being monitored by my XI 2014r2.6 server.
- Average load is normally 1.5-3.0 and ~500 total processes.
- This weekend we had major work being performed in one of our datacenters that caused us to down ~300 hosts and ~4800 additional services.
- I scheduled downtimes for all the services and hosts(thank god for scripting!).
Here are the load and total process charts for the past 48 hours:
localhost-current_load.jpg
localhost-total_processes.jpg
You can tell when the maintenance was by the number of processes and you can see the actual load on the server was not affected. The number of processes was running high because we give checks 660 seconds to finish and it was taking that time to timeout trying to do all the check_oracle_health checks it was trying during the outage time. If you browsed nagios you could see all checks were still being performed and notifications were being sent and everything else was working properly. The issue I was having was only affecting administrators, check out my dashboard:
nagios_status.JPG
I wasn't getting any status back from the checking script apparently. I do have mysql and ndo2db both offloaded onto another server. Both my XI server and the offloaded server are in the same VM cluster in a DC that was not being affected by the maintenance. I tried bouncing services a few times and it's work for maybe 1 minute and then go back to the same. As soon the outage was over everything resolved itself and I didn't need to do anything. Maybe a server came up that it relies on somehow?
So, I have 2 questions out of this mess...
1. The check that runs and validates server performance and stuff, can you think of any reason it wouldn't be working proerly during this mayhem or anything it may rely on in the script when the items are offloaded like I have them?
2. Is there any setting I can make that automatically makes services dependent upon their hosts? I'd love to set that up so checks are not performed while the host is down. I know that isn't default behavior, but I don't want to have to create 1000+ dependency configs.
You do not have the required permissions to view the files attached to this post.