Looking for the best approach to Windows drive monitoring

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
OUberLord
Posts: 17
Joined: Wed Mar 25, 2015 8:31 am

Looking for the best approach to Windows drive monitoring

Post by OUberLord »

Hello,

Currently we are monitoring the Windows disks in our environment using the following command:

Code: Select all

define command{
        command_name    check-nrpe-win-drive-free
        command_line    /usr/lib/nagios/plugins/check_nrpe -H $ARG1$ -c checkdrivesize -a ShowAll "FilterType=FIXED" CheckAll "MinWarnFree=$ARG2$" "MinCritFree=$ARG3$"
        }
This command works great in that it polls all drives on the host, and raises alerts as one would expect. However, the issue we face is this alert does nothing (nor should it) to detect if a drive suddenly goes missing. If iSCSI flips out and suddenly poor L drive is no more, this check simply shows the drives that still are presented and nothing (for it) is amiss.

On the flip side, I could manually monitor each disk on each server via individual services which would detect such failures. However, this option doesn't scale well, and requires manual intervention to make sure that it stays current.

Is there a better option? One that keeps the ease of disk checking down to a single boilerplate service check, but maybe some other means of being able to detect when a drive goes offline?
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Looking for the best approach to Windows drive monitorin

Post by jdalrymple »

OUberLord wrote:If iSCSI flips out and suddenly poor L drive is no more, this check simply shows the drives that still are presented and nothing (for it) is amiss.
You'd think that one of your other services might have an issue with the missing L drive? :)
OUberLord wrote:Is there a better option? One that keeps the ease of disk checking down to a single boilerplate service check, but maybe some other means of being able to detect when a drive goes offline?
The one you're using is likely the best boilerplate option today - and you're right it is not stateful so it won't detect if a drive is there and goes away. To reimplement the work that has been done in nsclient++ AND add in the additional feature you want is unlikely to happen outside of nsclient++. Also our magic wizards and such would no longer work with that, which you probably couldn't care less about. I think you're going to be reduced to bolting on the additional service check, one for which I see plausible candidates if you were talking Linux on the Exchange, but not Windows. I would expect a simple script to be pretty easy to powershell up.

In the meantime you could definitely put a feature request out on the nsclient++ github.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Looking for the best approach to Windows drive monitorin

Post by Box293 »

What you could do is:

Create separate hostgroups for hosts with each drive letter.
Create individual service checks for each drive and assign them to the relevant hostgroup.
Create a "All Disks OK" service that uses a custom perl/python/php plugin that:
a) executes check_nrpe request to the windows server which executes a local script that reports all the fixed drive letters
b) plugin takes that nrpe result and then queries nagios using json query
c) json query gets all the services for the host
d) compares the nrpe result of known fixed drives against existing drive services for that host
e) if there is a mismatch then return a critical state

There's a bit of work there but it's quite possible to make it happen.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
OUberLord
Posts: 17
Joined: Wed Mar 25, 2015 8:31 am

Re: Looking for the best approach to Windows drive monitorin

Post by OUberLord »

Apologies on the late response here, I could have sworn I had notifications on... :P
jdalrymple wrote:You'd think that one of your other services might have an issue with the missing L drive? :)
Haha yeah, for the most part you're right, but we've had a couple failures recently that don't really have a service directly dependent on them. The thought crossed my mind that I could also implement other types of checks, but it's all to easy to fall into some pretty hairy options along that line of thought when it comes to scalability and upkeep.
jdalrymple wrote:I think you're going to be reduced to bolting on the additional service check, one for which I see plausible candidates if you were talking Linux on the Exchange, but not Windows. I would expect a simple script to be pretty easy to powershell up.
I had a similar thought, but I grapple with what the actual method would be. I envisioned being able to have a script that enumerates what volumes and partitions are available, but falter in figuring out how to make it gracefully (if not automatically) handle newly added drives while also alerting if such a drive were to disappear.
Box293 wrote:What you could do is:

Create separate hostgroups for hosts with each drive letter.
Create individual service checks for each drive and assign them to the relevant hostgroup.
Create a "All Disks OK" service that uses a custom perl/python/php plugin that:
a) executes check_nrpe request to the windows server which executes a local script that reports all the fixed drive letters
b) plugin takes that nrpe result and then queries nagios using json query
c) json query gets all the services for the host
d) compares the nrpe result of known fixed drives against existing drive services for that host
e) if there is a mismatch then return a critical state

There's a bit of work there but it's quite possible to make it happen.
I've put some thought into those lines, but then I have concerns over the manual nature of it when it comes to keeping it current as new drives are added. Though, in reading your method and thinking about it more fully, it would seem that such a method would return a critical state not only if a drive is missing that shouldn't be but also if a drive exists that didn't before. It's a tiny bit clunky, but I really like that it would still handle both failure cases as such and thus removes the concern of "missing" a drive.

I am woefully unfamiliar with how to create custom plugins for Nagios and with the overall idea of json queries. Is there a good source of documentation online that I can use to bring myself up to speed on such?
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Looking for the best approach to Windows drive monitorin

Post by Box293 »

OUberLord wrote:I am woefully unfamiliar with how to create custom plugins for Nagios and with the overall idea of json queries. Is there a good source of documentation online that I can use to bring myself up to speed on such?
Here is some stuff on JSON:
http://labs.nagios.com/2014/06/19/explo ... -7-part-1/

Here is the official plugin development guidelines:
https://nagios-plugins.org/doc/guidelines.html

Here is some notes on NSClient++ and scripting:
http://docs.nsclient.org/tutorial/core/index.html

This is going to require some scripting skills. Plugins aren't that hard to write, because at the end of the plugin all you return to Nagios (or NSClient++) is some text and an exit code. With what you're trying to do, it's going to get complex. Start off simple and expand from there.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked