Handling monitoring of clusters member servers.
Posted: Mon Mar 04, 2013 1:56 pm
Hi,
We run nagiosxi R1.6 running monitoring servers and network gear in one of our datacenters.
We actually run two datacenters in an actice/active configuration with a nagiosxi in each. But the answer to this question should probably work for both.
Across both sites we operate a geographically disperse Microsoft SQL Cluster. That is, in Site A we have two SQL servers clustered together, where Server A:1 is normally active and Server A:2 is normally standby.
In Site B we also have two SQL servers clustered together, where Server B:1 is standby and server B:2 is also standby.
We perform disk replication at the SAN level via dark fibre between the two sites, so if we have failure with a site we lose no data since the cluster makes the secondary active. Should Site A entirely our SAN controller detects via agent on a each server this and instructs Server B:1 to come up. it also handles the shared disks and which server has access at any given time.
anyway, the question.
We monitor all the servers in both clusters, but obviously only one server has access to the disks at any one time. This results in an error in Nagios of "out of bounds 139 error" which is a classic windows error of basically "not found" or "I couldn't do this so here is a number instead".
As we also monitor whether or not the SQL process and SQL agent is running we also get errors for those on the standby servers to say they are down.
Is there a way in nagios to alert only on state change for these?.
Cheers,
KB.
We run nagiosxi R1.6 running monitoring servers and network gear in one of our datacenters.
We actually run two datacenters in an actice/active configuration with a nagiosxi in each. But the answer to this question should probably work for both.
Across both sites we operate a geographically disperse Microsoft SQL Cluster. That is, in Site A we have two SQL servers clustered together, where Server A:1 is normally active and Server A:2 is normally standby.
In Site B we also have two SQL servers clustered together, where Server B:1 is standby and server B:2 is also standby.
We perform disk replication at the SAN level via dark fibre between the two sites, so if we have failure with a site we lose no data since the cluster makes the secondary active. Should Site A entirely our SAN controller detects via agent on a each server this and instructs Server B:1 to come up. it also handles the shared disks and which server has access at any given time.
anyway, the question.
We monitor all the servers in both clusters, but obviously only one server has access to the disks at any one time. This results in an error in Nagios of "out of bounds 139 error" which is a classic windows error of basically "not found" or "I couldn't do this so here is a number instead".
As we also monitor whether or not the SQL process and SQL agent is running we also get errors for those on the standby servers to say they are down.
Is there a way in nagios to alert only on state change for these?.
Cheers,
KB.