Handling monitoring of clusters member servers.

KiwiBloke · Post by **KiwiBloke** » Mon Mar 04, 2013 1:56 pm

Hi,

We run nagiosxi R1.6 running monitoring servers and network gear in one of our datacenters.

We actually run two datacenters in an actice/active configuration with a nagiosxi in each. But the answer to this question should probably work for both.

Across both sites we operate a geographically disperse Microsoft SQL Cluster. That is, in Site A we have two SQL servers clustered together, where Server A:1 is normally active and Server A:2 is normally standby.

In Site B we also have two SQL servers clustered together, where Server B:1 is standby and server B:2 is also standby.

We perform disk replication at the SAN level via dark fibre between the two sites, so if we have failure with a site we lose no data since the cluster makes the secondary active. Should Site A entirely our SAN controller detects via agent on a each server this and instructs Server B:1 to come up. it also handles the shared disks and which server has access at any given time.

anyway, the question.

We monitor all the servers in both clusters, but obviously only one server has access to the disks at any one time. This results in an error in Nagios of "out of bounds 139 error" which is a classic windows error of basically "not found" or "I couldn't do this so here is a number instead".

As we also monitor whether or not the SQL process and SQL agent is running we also get errors for those on the standby servers to say they are down.

Is there a way in nagios to alert only on state change for these?.

Cheers,

KB.

abrist · Post by **abrist** » Mon Mar 04, 2013 2:51 pm

You can alert once and only once on state change by changing the "notification_interval" of the service to "0".
From: http://nagios.sourceforge.net/docs/3_0/ ... tions.html

notification_interval: This directive is used to define the number of "time units" to wait before re-notifying a contact that this service is still down or unreachable. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. If you set this value to 0, Nagios will not re-notify contacts about problems for this host - only one problem notification will be sent out.

You could also look into event handlers and the nagios external command pipe. With these you could create a script that gets run when the main db server goes into a problem state. This script would pass some commands to nagios through the external command pipe to start monitoring the necessary services on the failover. Once the main server is back up, the script will be fired again through the event handler and stop monitoring the select services on the failover, and once again monitoring them on the main.

The latter approach is highly customizable but also significantly more complex to setup.

http://nagios.sourceforge.net/docs/3_0/ ... dlers.html
http://nagios.sourceforge.net/docs/3_0/extcommands.html
http://old.nagios.org/developerinfo/ext ... ndlist.php
http://exchange.nagios.org/directory/Ad ... er/details

Nagios Support Forum

Handling monitoring of clusters member servers.

Handling monitoring of clusters member servers.

Re: Handling monitoring of clusters member servers.