Page 1 of 1

Does an actual "Windows Server" check exist?

Posted: Fri Jan 13, 2012 11:51 am
by sircharlo
Hey guys!

I've setup a Nagios installation on Ubuntu for work.
It works really well, and everyone loves it.

So far I've got about 60 hosts and 150 services, and everything's running mighty fine.

I've got a question though.
One of our servers crashed the other day. (W2K3, secondary DC).
There was an error message on the login screen, and the event viewer was filled with "delayed write" and "unable to write to registry" errors.
Basically, bad.

However, Nagios failed to alert us..
The server was somehow responding to pings, reporting its uptime, memory usage, and CPU usage.
So the host wasn't actually 'down' according to Nagios.

The only warning was on the drive monitoring, because NRPE timed out (because the performance counters were unqueryable due to the failure).

This left me wondering: is there a plugin or check command that will actually test to see if a Windows server is truly up and running?

A check that would perform tests such as:
  • testing a registry read/write
    testing a disk read/write
    checking to make sure there are no error messages on the login screen
    checking that network shares are readable
    etc..
Or am I just pushing it here?

Thanks!

Re: Does an actual "Windows Server" check exist?

Posted: Sun Jan 15, 2012 7:30 pm
by jsmurphy
In our environment we manage lower level availability in two ways... we use NSClient++ to monitor windows services as a basic application availability check (see first example) and we use the NagEventLog client for capturing specific niche failures (i.e. AD replication failures, drive write failures, etc)... this document is for Nagios XI rather than Core but you should still be able to get the basic necessary information from it: http://assets.nagios.com/downloads/nagi ... entLog.pdf

This is an example command definition for catch all service monitoring:
check_nrpe -H $HOSTADDRESS$ -u -c CheckServiceState -p 5666 -a CheckAll exclude=MSIServer

This will turn critical for any service that is set to automatic but isn't started, I've excluded MSIServer because while it is an automatic service it doesn't always run.