Weird behavior with clustered drives Nagios was monitoring

logic_bomb421 · Post by **logic_bomb421** » Tue Nov 11, 2014 9:02 pm

So I have some weird behavior I need to figure out, and I was hoping someone here may be able to shed some light on the answer.

So I have a set of SQL servers that feed our database. These servers use a series of clustered drives. The basic configuration is that we have ClusterSrv1 and ClusterSrv2, as well as SQLSrv1. ClusterSrv1 and 2 are configured in such a way that, if one of them goes down, the cluster of drives is handed off to its partner to keep the database running. SQLSrv1 is just an alias for whichever cluster server is currently in control of the drive cluster. Nagios was set to monitor SQLSrv1, along with all its drives (the clustered drives technically).

This brings me to the issue I'm currently facing. This past weekend, the cluster server that currently had control of the drive cluster had an incident where all the drives were passed off to its partner to handle. Ever since then, Nagios has been throwing criticals for all the drives. In the basic services overview, each drive says "Free disk space: Invalid Drive", and when you view each drive independently, the status returns UNKNOWN.

I'm not sure why this is happening. Yeah the cluster of drives was passed from one cluster server to the other over the weekend, but as far as Nagios should know, nothing changed (since it's only looking at SQLSrv1, the alias for whichever cluster server is in control). What am I overlooking here? Does anyone see something I'm missing?

All I'm using to check the drives is checknt!USEDDISKSPACE.

Thank you.

Post by **Box293** » Tue Nov 11, 2014 10:53 pm

Can you show us:

the host definition for SQLSrv1
the service definition that uses the command checknt!USEDDISKSPACE
the command definition for checknt

logic_bomb421 · Post by **logic_bomb421** » Wed Nov 12, 2014 1:28 pm

Box293 wrote:the host definition for SQLSrv1

Code: Select all

define host{
	use		       windows-server
	host_name	    SQLSrv1
	alias		     SQLSrv1
	address		   /IP address of SQLSrv1/
	statusmap_image server.png
	}

Box293 wrote:the service definition that uses the command checknt!USEDDISKSPACE

This is a selection of the clustered disk monitoring definitions. They all look like this.

Code: Select all

define service{
	use			          generic-service
	host_name		       SQLSrv1
	service_description	L:\ Clustered Disk
	check_command		   check_nt!USEDDISKSPACE!-l l -w 90 -c 95
	}

define service{
	use			          generic-service
	host_name		       SQLSrv1
	service_description	N:\ TempDB1
	check_command		   check_nt!USEDDISKSPACE!-l n -w 83 -c 91.5
	}

Box293 wrote:the command definition for checknt

Code: Select all

# 'check_nt' command definition
define command{
	command_name	check_nt
	command_line	$USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$
	}

I also included the template definitions that a few of these referenced just in case.

Code: Select all

define host{
	name			windows-server	; The name of this host template
	use			generic-host	; Inherit default values from the generic-host template
	check_period		24x7		; By default, Windows servers are monitored round the clock
	check_interval		1		; Actively check the server every 1 minutes
	retry_interval		0.5		; Schedule host check retries at 0.5 minute intervals
	max_check_attempts	2		; Check each server 2 times (max)
	check_command		check-host-alive	; Default command to check if servers are "alive"
	notification_period	24x7		; Send notification out at any time - day or night
	notification_interval	10		; Resend notifications every 10 minutes
	notification_options	d,r		; Only send notifications for specific host states
	contact_groups		admins		; Notifications get sent to the admins by default
	hostgroups		2		; Host groups that Windows servers should be a member of
	register		0		; DONT REGISTER THIS - ITS JUST A TEMPLATE
	}

Code: Select all

define service{
        name                            generic-service 	; The 'name' of this service template
        active_checks_enabled           1       		; Active service checks are enabled
        passive_checks_enabled          1    		   	; Passive service checks are enabled/accepted
        parallelize_check               1       		; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       		; We should obsess over this service (if necessary)
        check_freshness                 0       		; Default is to NOT check service 'freshness'
        notifications_enabled           1       		; Service notifications are enabled
        event_handler_enabled           1       		; Service event handler is enabled
        flap_detection_enabled          1       		; Flap detection is enabled
        process_perf_data               1       		; Process performance data
        retain_status_information       1       		; Retain status information across program restarts
        retain_nonstatus_information    1       		; Retain non-status information across program restarts
        is_volatile                     0       		; The service is not volatile
        check_period                    24x7			; The service can be checked at any time of the day
        max_check_attempts              3			; Re-check the service up to 3 times in order to determine its final (hard) state
        normal_check_interval           5			; Check the service every 5 minutes under normal conditions
        retry_check_interval            2			; Re-check the service every two minutes until a hard state can be determined
        contact_groups                  admins			; Notifications get sent out to everyone in the 'admins' group
	notification_options		u,c,r			; Send notifications about unknown, critical, and recovery events
        notification_interval           10			; Re-notify about service problems every 10 minutes
        notification_period             24x7			; Notifications can be sent out at any time
         register                        0       		; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

millisa · Post by **millisa** » Wed Nov 12, 2014 5:56 pm

If you restart your nsclient service on your sql nodes do the drives get picked up then?
I have a vague recollection of drive letter issues on an old win2k sql cluster where the drive letters didn't show up for an already logged in session until logout/login.

Quick edit: This may apply:
http://nsclient.org/nscp/wiki/guides/clusters

slansing · Post by **slansing** » Wed Nov 12, 2014 6:00 pm

Awesome tip Millisa, thanks! Let us know what you find out logic.

logic_bomb421 · Post by **logic_bomb421** » Tue Nov 18, 2014 12:32 pm

Thank you all for the replies. Unfortunately this did not seem to work for me. I'm still seeing the Unknown error on all the drives on this server.

abrist · Post by **abrist** » Tue Nov 18, 2014 5:47 pm

Are these mapped drives? I ask because they cannot be checked the normal way unless the user is logged in.

Nagios Support Forum

Weird behavior with clustered drives Nagios was monitoring

Weird behavior with clustered drives Nagios was monitoring

Re: Weird behavior with clustered drives Nagios was monitori

Re: Weird behavior with clustered drives Nagios was monitori

Re: Weird behavior with clustered drives Nagios was monitori

Re: Weird behavior with clustered drives Nagios was monitori

Re: Weird behavior with clustered drives Nagios was monitori

Re: Weird behavior with clustered drives Nagios was monitori