Page 1 of 3
Engine hosed after network disconnect
Posted: Fri Oct 04, 2013 4:54 pm
by vAJ
Seen this twice now where a NagiosXI 2012R2.2 instance will hose up after the management network it sits on gets dropped from everything it monitors.
All hosts/services show down and will not clear, even after network connectivity is fully restored and verified. Only thing that fixes it is a restart of the monitoring engine.
Any ideas where to start on this?
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 9:05 am
by slansing
management network it sits on gets dropped from everything it monitors.
So this is something happening in your network and not nagios? How long do you wait after your network comes back up for Nagios to return to normal checking? Do some of your hosts come back up, but just slowly? This could be caused by your check intervals, or they could have reached their maximum retry amount and were just waiting a very long time to re-check again.
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 9:13 am
by vAJ
Yes, right now our NetEng team is making some major modifications to L3 configuration which has caused brief, planned outages of the mgmt network. When this happens, all hosts & services light up the board. No reasonable amount of time (NOC has waited up to 2 hours) will bring these hosts/services back to OK status. Soon as they restart the monitoring engine, everything is good.
Check intervals all at 5min (retries at 1).
System:
Nagios XI Version : 2012R2.2
nagiosapp 2.6.32-279.1.1.el6.x86_64 x86_64
CentOS release 6.3 (Final)
Gnome is not installed
Total Hosts: 483
Total Services: 4469
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 9:27 am
by BanditBBS
vAJ,
Have you tried manually running any checks(from command line) when the interface isn't working and see if any errors appear or if the check runs fine?
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 9:42 am
by vAJ
Unfortunately, this always happens during a maintenance window when I am sleeping. The NOC takes it upon themselves to troubleshoot the issue without contacting me (they like me better when I get a full night's rest).
I'm still digging through logs from that last occurrence. Not finding much.
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 9:48 am
by abrist
Are the outages planned, and is downtime scheduled in nagios?
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 9:55 am
by vAJ
If only I could get them to do this...
Mostly they don't mind the alert storm. Validates that monitoring is working...
But no, no changes are made to monitoring status. But they are planned from an organization standpoint.
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 11:54 am
by slansing
Well when everything you are monitoring is dropped "potentially thousands of objects" at once, there can be some engine issues. Are you running an offloaded mysql database on another section of your network? Or modgearman workers which do the check / perfdata processing for you remotely?
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 1:25 pm
by vAJ
No modgearman, but offloaded MySQL. If engine loses DB connection, is it not resilient enough to regain connectivity? Thus a manual engine restart is necessary?
Re: Engine hosed after network disconnect
Posted: Mon Oct 07, 2013 1:30 pm
by BanditBBS
vAJ wrote:No modgearman, but offloaded MySQL. If engine loses DB connection, is it not resilient enough to regain connectivity? Thus a manual engine restart is necessary?
This would definitely explain an issue I saw once before with my offloaded DB. Making more sense now. Eagerly awaiting Nagios reply
