So I have some weird behavior I need to figure out, and I was hoping someone here may be able to shed some light on the answer.
So I have a set of SQL servers that feed our database. These servers use a series of clustered drives. The basic configuration is that we have ClusterSrv1 and ClusterSrv2, as well as SQLSrv1. ClusterSrv1 and 2 are configured in such a way that, if one of them goes down, the cluster of drives is handed off to its partner to keep the database running. SQLSrv1 is just an alias for whichever cluster server is currently in control of the drive cluster. Nagios was set to monitor SQLSrv1, along with all its drives (the clustered drives technically).
This brings me to the issue I'm currently facing. This past weekend, the cluster server that currently had control of the drive cluster had an incident where all the drives were passed off to its partner to handle. Ever since then, Nagios has been throwing criticals for all the drives. In the basic services overview, each drive says "Free disk space: Invalid Drive", and when you view each drive independently, the status returns UNKNOWN.
I'm not sure why this is happening. Yeah the cluster of drives was passed from one cluster server to the other over the weekend, but as far as Nagios should know, nothing changed (since it's only looking at SQLSrv1, the alias for whichever cluster server is in control). What am I overlooking here? Does anyone see something I'm missing?
All I'm using to check the drives is checknt!USEDDISKSPACE.
Thank you.
Weird behavior with clustered drives Nagios was monitoring
-
- Posts: 43
- Joined: Tue Jul 15, 2014 6:58 pm
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Weird behavior with clustered drives Nagios was monitori
Can you show us:
- the host definition for SQLSrv1
the service definition that uses the command checknt!USEDDISKSPACE
the command definition for checknt
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
- Posts: 43
- Joined: Tue Jul 15, 2014 6:58 pm
Re: Weird behavior with clustered drives Nagios was monitori
Box293 wrote:the host definition for SQLSrv1
Code: Select all
define host{
use windows-server
host_name SQLSrv1
alias SQLSrv1
address /IP address of SQLSrv1/
statusmap_image server.png
}
This is a selection of the clustered disk monitoring definitions. They all look like this.Box293 wrote:the service definition that uses the command checknt!USEDDISKSPACE
Code: Select all
define service{
use generic-service
host_name SQLSrv1
service_description L:\ Clustered Disk
check_command check_nt!USEDDISKSPACE!-l l -w 90 -c 95
}
define service{
use generic-service
host_name SQLSrv1
service_description N:\ TempDB1
check_command check_nt!USEDDISKSPACE!-l n -w 83 -c 91.5
}
Box293 wrote:the command definition for checknt
Code: Select all
# 'check_nt' command definition
define command{
command_name check_nt
command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$
}
Code: Select all
define host{
name windows-server ; The name of this host template
use generic-host ; Inherit default values from the generic-host template
check_period 24x7 ; By default, Windows servers are monitored round the clock
check_interval 1 ; Actively check the server every 1 minutes
retry_interval 0.5 ; Schedule host check retries at 0.5 minute intervals
max_check_attempts 2 ; Check each server 2 times (max)
check_command check-host-alive ; Default command to check if servers are "alive"
notification_period 24x7 ; Send notification out at any time - day or night
notification_interval 10 ; Resend notifications every 10 minutes
notification_options d,r ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
hostgroups 2 ; Host groups that Windows servers should be a member of
register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE
}
Code: Select all
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
normal_check_interval 5 ; Check the service every 5 minutes under normal conditions
retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options u,c,r ; Send notifications about unknown, critical, and recovery events
notification_interval 10 ; Re-notify about service problems every 10 minutes
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
Re: Weird behavior with clustered drives Nagios was monitori
If you restart your nsclient service on your sql nodes do the drives get picked up then?
I have a vague recollection of drive letter issues on an old win2k sql cluster where the drive letters didn't show up for an already logged in session until logout/login.
Quick edit: This may apply:
http://nsclient.org/nscp/wiki/guides/clusters
I have a vague recollection of drive letter issues on an old win2k sql cluster where the drive letters didn't show up for an already logged in session until logout/login.
Quick edit: This may apply:
http://nsclient.org/nscp/wiki/guides/clusters
-
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Weird behavior with clustered drives Nagios was monitori
Awesome tip Millisa, thanks! Let us know what you find out logic.
-
- Posts: 43
- Joined: Tue Jul 15, 2014 6:58 pm
Re: Weird behavior with clustered drives Nagios was monitori
Thank you all for the replies. Unfortunately this did not seem to work for me. I'm still seeing the Unknown error on all the drives on this server.
Re: Weird behavior with clustered drives Nagios was monitori
Are these mapped drives? I ask because they cannot be checked the normal way unless the user is logged in.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.