Page 1 of 1

XI Failing without Reporting IT

Posted: Thu Aug 20, 2015 9:55 am
by rseiwert
Had an interesting morning due to a communication failure with our SAN. The problem was in the network switches but that is not why I'm posting here. When I checked this morning XI showed everything up and no issues. We Nagios XI 2014R2.6 and auto-login a readonly user by default. We use the Operation Center screen, which was time stamped with the current data and time and was not reporting anything down.

When the SAN is down all the VM clusters are down and Nagios XI runs on the VM cluster along with email, and database, and file servers, and Citrix PNA, and Everything else. Everything was down (including Nagios) but Nagios didn't know it. Nagios's HD had been ripped out from under it but according to the screen everyone looks at everything was OK.

It is my humble opinion that when invalid data is being presented there should be some indication that it is invalid. I do have checks for XI Daemons, XI Jobs, ActiveHostChecks, and ActiveServiceChecks but of course Nagios is failed they mean nothing. I really really feel that somehow when the checks are stale they should be reported as such. I also feel that the system health checks should be reported to non-admin users so they at least have a clue that what they are looking at is invalid. Finally the system health checks need to be made be more accurate, actually verifying that the PID is the process it thinks and if possible check for a heart beat on the service.

Re: XI Failing without Reporting IT

Posted: Thu Aug 20, 2015 10:20 am
by WillemDH
Monitor the production monitoring Nagios XI from another Nagios Core or XI, preferably from another datacenter / location. You can install a Nagios XI free edition that can monitor up to 7 hosts. Grtz

Re: XI Failing without Reporting IT

Posted: Thu Aug 20, 2015 1:02 pm
by rseiwert
A good idea. But if Nagios is reporting it's up will that really help? Ping and HTTP checks would not in this instance. Possibly remotely executing the system health checks but it has been documented that these are not accurate.

The point of Nagios to me is a single pane of glass monitoring. To check Nagios then check Nagios to check Nagios. Quis custodiet ipsos custodes? The true problem is that the XI PHP that generates the web pages should be able to figure out something is rotten, stale, or if it even has a heartbeat.

Then present that information to non-admin users. System health is only exposed to administrators. Most people here do not login, rather use the read-only autologin until they need to acknowledge an issue or configure the system.

Re: XI Failing without Reporting IT

Posted: Thu Aug 20, 2015 1:46 pm
by jdalrymple
I can recreate this and I agree, the first line of defense should be some amount of monitoring of Nagios being performed by the browser. I can recreate your circumstances pretty easily.

I'll bring it up with the devs and let you know what they say.

Re: XI Failing without Reporting IT

Posted: Thu Aug 20, 2015 2:39 pm
by jdalrymple
Internal feature request created. No ETA as usual, but being as high profile of an issue as this is I'd expect it to receive high priority by the devs and make it into the next version.

Re: XI Failing without Reporting IT

Posted: Fri Aug 21, 2015 2:35 am
by Box293
FYI there is the "Nagios XI Wizard" which checks a number of XI things and this can be a remote server. So have a free XI monitoring production and production monitoring the free instance.