| Solution: |
This solution has been proven to work under a very common scenario, although it should be possible to deploy it on an n-nodes
environment:
* master-master cluster (2 nodes); that is, both nodes run pplications
* load-balanced
Some definitions for such an scenario, among others:
* IP address for node A (IPa)
* IP address for node B (IPb)
* IP address for the cluster itself an IP (IPc)
* cluster services (i.e. Oracle instances, exported filesystems, etc)
* host services (i.e. system load, local filesystems such as /, /var, /opt, etc depending on
how you partitioned the hard disk or volume or whatever).
The remote checks using TCP connections from the Nagios box, such as PING,check_http, check_ftp, check_tcp!port
don't represent a problem. In my case, I had to think of something when wanting to check for cluster services
such as shared storage, Oracle instances, etc because check_cluster segfaulted and check_cluster2 *always* returned "ok"
(maybe this is a design philosophy).
Because you have multiple IP addresses on the cluster nodes you need to tell NRPE to listen to everything and not to bind
to a single IP address. This has been taken from nrpe.cfg:
---
# SERVER ADDRESS
# Address that nrpe should bind to in case there are more than one interface
# and you do not want nrpe to bind on all interfaces.
# NOTE: This option is ignored if NRPE is running under either inetd or xinetd
#server_address=your.ip.address.here
---
Hint: comment 'server_address' for it to work. If you leave it blank NRPE won't even start.
Then define on nrpe.cfg *both* cluster and host services. Something like this:
---
# The following 6 lines are for host (node-specific) services:
#
# we want to monitor /, /var, zombie procs, system load, users on the system and total amount of procs.
#
command[check_disk_root]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/rootvol
command[check_disk_var]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/var
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_users]=/usr/local/nagios/libexec/check_users -w 50 -c 75
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 1000 -c 1200
#
# These lines define the cluster services
#
# a lot of check_disk stuff...
#
command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/vgai1dg/pvgai101
command[check_disk2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/vgai1dg/pvgai102
command[check_disk3]=/usr/local/nagios/libexec/check_disk -w 8% -c 2% -p /dev/vx/dsk/vgai1dg/pvgai103
.
.
.
# and some Oracle instances, for the sake of completeness
command[check_oracle_YOUR_ORACLESID1]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID1
command[check_oracle_YOUR_ORACLESID2]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID2
---
This part of the nrpe.cfg must be the same on all cluster nodes.
In this case, we have taken special care to configure both nodes the same way, that is, even the device names
and mount points are the same. If this is not your case don't worry! it will also work but you'll must pay close
attention to configure everything correctly, otherwise you'll get a lot of false positives.
Now it's time to define some things on hosts.cfg and services.cfg (and serviceextinfo.cfg if you use it):
* hosts.cfg:
------------
define host{
use generic-host ; Name of host template to use
host_name node-A
alias MYCLUSTER (node 1)
address IPa <-- node A real IPv4 address
check_command check-host-alive
max_check_attempts 10
notification_interval 120
notification_period 24x7
notification_options d,u,r
}
define host{
use generic-host ; Name of host template to use
host_name node-B
alias MYCLUSTER (node 2)
address IPb <-- node B real IPv4 address
check_command check-host-alive
max_check_attempts 10
notification_interval 120
notification_period 24x7
notification_options d,u,r
}
define host{
use generic-host ; Name of host template to use
host_name mycluster
alias MYCLUSTER
address IPc <-- cluster virtual IPv4 address
check_command check-host-alive
max_check_attempts 10
notification_interval 120
notification_period 24x7
notification_options d,u,r
}
* services.cfg:
---------------
For node-A (just some services on this example):
define service{
use generic-service ; Name of service template to use
host_name node-A
service_description SysLoad
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24x7
notification_options c,r
check_command check_nrpe!check_load
}
define service{
use generic-service ; Name of service template to use
host_name node-A
service_description ROOT_DISK
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24x7
notification_options c,r
check_command check_nrpe!check_disk_root
}
Proceed the same way for node-B or just define the services you want to check.
For the cluster (also just some service examples...):
define service{
use generic-service ; Name of service template to use
host_name mycluster
service_description SysLoad
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24x7
notification_options c,r
check_command check_nrpe!check_load
}
define service{
use generic-service ; Name of service template to use
host_name mycluster
service_description ORACLE_MYORACLESID
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 120
notification_period 24x7
notification_options c,r
check_command check_nrpe!check_oracle_myoraclesid
}
After all these steps, you will end up with 3 (or more, depending on how many nodes your cluster is made of) machines
on the web interface:
* node-A
checking for SysLoad
checking for ROOT_DISK
* node-B
checking for AnotherService_1
checking for AnotherService_2
* cluster
checking for SysLoad
checking for ORACLE_MYORACLESID
Please write with any comments or corrections to
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
__but_remove_this_crap_first_:-)
|