Seen this twice now where a NagiosXI 2012R2.2 instance will hose up after the management network it sits on gets dropped from everything it monitors.
All hosts/services show down and will not clear, even after network connectivity is fully restored and verified. Only thing that fixes it is a restart of the monitoring engine.
Any ideas where to start on this?
Engine hosed after network disconnect
Engine hosed after network disconnect
Andrew J. - Do you even grok?
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Engine hosed after network disconnect
So this is something happening in your network and not nagios? How long do you wait after your network comes back up for Nagios to return to normal checking? Do some of your hosts come back up, but just slowly? This could be caused by your check intervals, or they could have reached their maximum retry amount and were just waiting a very long time to re-check again.management network it sits on gets dropped from everything it monitors.
Re: Engine hosed after network disconnect
Yes, right now our NetEng team is making some major modifications to L3 configuration which has caused brief, planned outages of the mgmt network. When this happens, all hosts & services light up the board. No reasonable amount of time (NOC has waited up to 2 hours) will bring these hosts/services back to OK status. Soon as they restart the monitoring engine, everything is good.
Check intervals all at 5min (retries at 1).
System:
Nagios XI Version : 2012R2.2
nagiosapp 2.6.32-279.1.1.el6.x86_64 x86_64
CentOS release 6.3 (Final)
Gnome is not installed
Total Hosts: 483
Total Services: 4469
Check intervals all at 5min (retries at 1).
System:
Nagios XI Version : 2012R2.2
nagiosapp 2.6.32-279.1.1.el6.x86_64 x86_64
CentOS release 6.3 (Final)
Gnome is not installed
Total Hosts: 483
Total Services: 4469
Andrew J. - Do you even grok?
Re: Engine hosed after network disconnect
vAJ,
Have you tried manually running any checks(from command line) when the interface isn't working and see if any errors appear or if the check runs fine?
Have you tried manually running any checks(from command line) when the interface isn't working and see if any errors appear or if the check runs fine?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Engine hosed after network disconnect
Unfortunately, this always happens during a maintenance window when I am sleeping. The NOC takes it upon themselves to troubleshoot the issue without contacting me (they like me better when I get a full night's rest).
I'm still digging through logs from that last occurrence. Not finding much.
I'm still digging through logs from that last occurrence. Not finding much.
Andrew J. - Do you even grok?
Re: Engine hosed after network disconnect
Are the outages planned, and is downtime scheduled in nagios?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Engine hosed after network disconnect
If only I could get them to do this...
Mostly they don't mind the alert storm. Validates that monitoring is working...
But no, no changes are made to monitoring status. But they are planned from an organization standpoint.
Mostly they don't mind the alert storm. Validates that monitoring is working...
But no, no changes are made to monitoring status. But they are planned from an organization standpoint.
Andrew J. - Do you even grok?
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Engine hosed after network disconnect
Well when everything you are monitoring is dropped "potentially thousands of objects" at once, there can be some engine issues. Are you running an offloaded mysql database on another section of your network? Or modgearman workers which do the check / perfdata processing for you remotely?
Re: Engine hosed after network disconnect
No modgearman, but offloaded MySQL. If engine loses DB connection, is it not resilient enough to regain connectivity? Thus a manual engine restart is necessary?
Andrew J. - Do you even grok?
Re: Engine hosed after network disconnect
This would definitely explain an issue I saw once before with my offloaded DB. Making more sense now. Eagerly awaiting Nagios replyvAJ wrote:No modgearman, but offloaded MySQL. If engine loses DB connection, is it not resilient enough to regain connectivity? Thus a manual engine restart is necessary?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github