XI Failing without Reporting IT
Posted: Thu Aug 20, 2015 9:55 am
Had an interesting morning due to a communication failure with our SAN. The problem was in the network switches but that is not why I'm posting here. When I checked this morning XI showed everything up and no issues. We Nagios XI 2014R2.6 and auto-login a readonly user by default. We use the Operation Center screen, which was time stamped with the current data and time and was not reporting anything down.
When the SAN is down all the VM clusters are down and Nagios XI runs on the VM cluster along with email, and database, and file servers, and Citrix PNA, and Everything else. Everything was down (including Nagios) but Nagios didn't know it. Nagios's HD had been ripped out from under it but according to the screen everyone looks at everything was OK.
It is my humble opinion that when invalid data is being presented there should be some indication that it is invalid. I do have checks for XI Daemons, XI Jobs, ActiveHostChecks, and ActiveServiceChecks but of course Nagios is failed they mean nothing. I really really feel that somehow when the checks are stale they should be reported as such. I also feel that the system health checks should be reported to non-admin users so they at least have a clue that what they are looking at is invalid. Finally the system health checks need to be made be more accurate, actually verifying that the PID is the process it thinks and if possible check for a heart beat on the service.
When the SAN is down all the VM clusters are down and Nagios XI runs on the VM cluster along with email, and database, and file servers, and Citrix PNA, and Everything else. Everything was down (including Nagios) but Nagios didn't know it. Nagios's HD had been ripped out from under it but according to the screen everyone looks at everything was OK.
It is my humble opinion that when invalid data is being presented there should be some indication that it is invalid. I do have checks for XI Daemons, XI Jobs, ActiveHostChecks, and ActiveServiceChecks but of course Nagios is failed they mean nothing. I really really feel that somehow when the checks are stale they should be reported as such. I also feel that the system health checks should be reported to non-admin users so they at least have a clue that what they are looking at is invalid. Finally the system health checks need to be made be more accurate, actually verifying that the PID is the process it thinks and if possible check for a heart beat on the service.