Engine hosed after network disconnect

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
vAJ
Posts: 456
Joined: Thu Nov 08, 2012 5:09 pm
Location: Austin, TX

Engine hosed after network disconnect

Post by vAJ »

Seen this twice now where a NagiosXI 2012R2.2 instance will hose up after the management network it sits on gets dropped from everything it monitors.

All hosts/services show down and will not clear, even after network connectivity is fully restored and verified. Only thing that fixes it is a restart of the monitoring engine.

Any ideas where to start on this?
Andrew J. - Do you even grok?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Engine hosed after network disconnect

Post by slansing »

management network it sits on gets dropped from everything it monitors.
So this is something happening in your network and not nagios? How long do you wait after your network comes back up for Nagios to return to normal checking? Do some of your hosts come back up, but just slowly? This could be caused by your check intervals, or they could have reached their maximum retry amount and were just waiting a very long time to re-check again.
vAJ
Posts: 456
Joined: Thu Nov 08, 2012 5:09 pm
Location: Austin, TX

Re: Engine hosed after network disconnect

Post by vAJ »

Yes, right now our NetEng team is making some major modifications to L3 configuration which has caused brief, planned outages of the mgmt network. When this happens, all hosts & services light up the board. No reasonable amount of time (NOC has waited up to 2 hours) will bring these hosts/services back to OK status. Soon as they restart the monitoring engine, everything is good.

Check intervals all at 5min (retries at 1).

System:

Nagios XI Version : 2012R2.2
nagiosapp 2.6.32-279.1.1.el6.x86_64 x86_64
CentOS release 6.3 (Final)
Gnome is not installed

Total Hosts: 483
Total Services: 4469
Andrew J. - Do you even grok?
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Engine hosed after network disconnect

Post by BanditBBS »

vAJ,

Have you tried manually running any checks(from command line) when the interface isn't working and see if any errors appear or if the check runs fine?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
vAJ
Posts: 456
Joined: Thu Nov 08, 2012 5:09 pm
Location: Austin, TX

Re: Engine hosed after network disconnect

Post by vAJ »

Unfortunately, this always happens during a maintenance window when I am sleeping. The NOC takes it upon themselves to troubleshoot the issue without contacting me (they like me better when I get a full night's rest).

I'm still digging through logs from that last occurrence. Not finding much.
Andrew J. - Do you even grok?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Engine hosed after network disconnect

Post by abrist »

Are the outages planned, and is downtime scheduled in nagios?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
vAJ
Posts: 456
Joined: Thu Nov 08, 2012 5:09 pm
Location: Austin, TX

Re: Engine hosed after network disconnect

Post by vAJ »

If only I could get them to do this...

Mostly they don't mind the alert storm. Validates that monitoring is working...

But no, no changes are made to monitoring status. But they are planned from an organization standpoint.
Andrew J. - Do you even grok?
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Engine hosed after network disconnect

Post by slansing »

Well when everything you are monitoring is dropped "potentially thousands of objects" at once, there can be some engine issues. Are you running an offloaded mysql database on another section of your network? Or modgearman workers which do the check / perfdata processing for you remotely?
vAJ
Posts: 456
Joined: Thu Nov 08, 2012 5:09 pm
Location: Austin, TX

Re: Engine hosed after network disconnect

Post by vAJ »

No modgearman, but offloaded MySQL. If engine loses DB connection, is it not resilient enough to regain connectivity? Thus a manual engine restart is necessary?
Andrew J. - Do you even grok?
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Engine hosed after network disconnect

Post by BanditBBS »

vAJ wrote:No modgearman, but offloaded MySQL. If engine loses DB connection, is it not resilient enough to regain connectivity? Thus a manual engine restart is necessary?
This would definitely explain an issue I saw once before with my offloaded DB. Making more sense now. Eagerly awaiting Nagios reply :)
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Locked