View FAQ

[ Return To FAQ Index | Search The FAQs ]


FAQ Database : Addons : NRPE

Title:How to monitor cluster servers using NRPE
FAQ ID:F0305
Submitted By:Martin Mielke 
Last Updated:04/14/2009

Description:How to monitor cluster servers using NRPE 

Solution:
This solution has been proven to work under a very common scenario, although it should be possible to deploy it on an n-nodes
environment:

	* master-master cluster (2 nodes); that is, both nodes run pplications
	* load-balanced

Some definitions for such an scenario, among others:

	* IP address for node A (IPa)
	* IP address for node B (IPb)
	* IP address for the cluster itself an IP (IPc)
	* cluster services (i.e. Oracle instances, exported filesystems, etc)
	* host services (i.e. system load, local filesystems such as /, /var, /opt, etc depending on 
	  how you partitioned the hard disk or volume or whatever).

The remote checks using TCP connections from the Nagios box, such as PING,check_http, check_ftp, check_tcp!port 
don't represent a problem. In my case, I had to think of something when wanting to check for cluster services
such as shared storage, Oracle instances, etc because check_cluster segfaulted and check_cluster2 *always* returned "ok"
(maybe this is a design philosophy).

Because you have multiple IP addresses on the cluster nodes you need to tell NRPE to listen to everything and not to bind
to a single IP address. This has been taken from nrpe.cfg:

---
# SERVER ADDRESS
# Address that nrpe should bind to in case there are more than one interface
# and you do not want nrpe to bind on all interfaces.
# NOTE: This option is ignored if NRPE is running under either inetd or xinetd

#server_address=your.ip.address.here
---

Hint: comment 'server_address' for it to work. If you leave it blank NRPE won't even start.

Then define on nrpe.cfg *both* cluster and host services. Something like this:

---
# The following 6 lines are for host (node-specific) services:
#
# we want to monitor /, /var, zombie procs, system load, users on the system and total amount of procs.
#
command[check_disk_root]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/rootvol
command[check_disk_var]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/var
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_users]=/usr/local/nagios/libexec/check_users -w 50 -c 75
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 1000 -c 1200

#
# These lines define the cluster services
#
# a lot of check_disk stuff...
#
command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/vgai1dg/pvgai101
command[check_disk2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/vgai1dg/pvgai102
command[check_disk3]=/usr/local/nagios/libexec/check_disk -w 8% -c 2% -p /dev/vx/dsk/vgai1dg/pvgai103
	.
	.
	.

# and some Oracle instances, for the sake of completeness
command[check_oracle_YOUR_ORACLESID1]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID1
command[check_oracle_YOUR_ORACLESID2]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID2
---

This part of the nrpe.cfg must be the same on all cluster nodes.  
In this case, we have taken special care to configure both nodes the same way, that is, even the device names 
and mount points are the same. If this is not your case don't worry! it will also work but you'll must pay close 
attention to configure everything correctly, otherwise you'll get a lot of false positives.

Now it's time to define some things on hosts.cfg and services.cfg (and serviceextinfo.cfg if you use it):

* hosts.cfg: 
------------

define host{
        use                     generic-host            ; Name of host template to use

        host_name               node-A
        alias                   MYCLUSTER (node 1)
        address                 IPa			<-- node A real IPv4 address
        check_command           check-host-alive
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }

define host{
        use                     generic-host            ; Name of host template to use

        host_name               node-B
        alias                   MYCLUSTER (node 2)
        address                 IPb			<-- node B real IPv4 address
        check_command           check-host-alive
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }
	
define host{
        use                     generic-host            ; Name of host template to use

        host_name               mycluster
        alias                   MYCLUSTER
        address                 IPc			<-- cluster virtual IPv4 address
        check_command           check-host-alive
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }
	

* services.cfg:
---------------

For node-A (just some services on this example):

define service{
        use                             generic-service         ; Name of service template to use

        host_name                       node-A
        service_description             SysLoad
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_load
}

define service{
        use                             generic-service         ; Name of service template to use

        host_name                       node-A
        service_description             ROOT_DISK
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_disk_root
}

Proceed the same way for node-B or just define the services you want to check.

For the cluster (also just some service examples...):

define service{
        use                             generic-service         ; Name of service template to use

        host_name                       mycluster
        service_description             SysLoad
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_load
}

define service{
        use                             generic-service         ; Name of service template to use

        host_name                       mycluster
        service_description             ORACLE_MYORACLESID
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_oracle_myoraclesid
}


After all these steps, you will end up with 3 (or more, depending on how many nodes your cluster is made of) machines
on the web interface:

	* node-A
		checking for SysLoad
		checking for ROOT_DISK

	* node-B
		checking for AnotherService_1
		checking for AnotherService_2

	* cluster
		checking for SysLoad
		checking for ORACLE_MYORACLESID


Please write with any comments or corrections to martin@mielke.com__but_remove_this_crap_first_:-)
 

Keywords:nrpe cluster monitoring