Active/Passive HA cluster nodes

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
tviset
Posts: 7
Joined: Tue Dec 03, 2019 9:30 am

Active/Passive HA cluster nodes

Post by tviset »

Hiall,

I am fairly new to Nagios, but impressed with the capabilities.

However, I am trying to get some things done, but I just can't seem to get it going.

I have Nagios installed on a cloud-based server, doing nothing else than monitoring a set of hosts behind a VPN, so the monitored servers are not in the same network as the Nagios server and not continuously reachable from the Nagios server. The monitored hosts are behind a firewall and configured to send information out to the Nagios server (not to receive any). Therefore I configured the monitored hosts as passive checks only, using NCPA to send information to Nagios.

2 of the monitored hosts are clustered using Corosync and DRBD, a 3rd host is a standalone machine.
In the cluster, I have 4 named services
- NGINX
- MySQL
- shared disk space (that switches with the other named services to the actual node when the HA cluster switches from one to the other)
- an application system (a product manufactured by us)

The nodes have individual hostnames (let's call them servername01 and servername02) and IP addresses and the cluster has a hostname (let's call that servername) and a virtual IP address.

Due to the nature of the Corosync/DRBD cluster configuration, the cluster node (called servername) is actually the active cluster node, so that would be servername01 or servername02, depending on which is master and which one is slave), but it is reachable using the cluster name servername.

I want to
- monitor the resources of both clustered nodes individually (disk usage, mem usage, etc. for servername01 and servname02)
- monitor the status of the cluster (active node, passive node, the configured named services, etc. for servername)
- monitor the status of the software running in the named services individually (NGINX is healthy, MySQL is healthy, etc.)
- monitor certain processes running in the named service running our application system on servername (the named service of our application system could be running while one or more specific processes within the application system have stopped and I want to see that in Nagios)

I installed NCPA on both cluster nodes with a unique ncpa.cfg for each (an ncpa.cfg on servername01 in /usr/local/ncpa/etc and an ncpa.cfg on servername02 in /usr/local/ncpa/etc), and Nagios receives and shows disk usage, mem usage, etc of these. On both machines I have a config file in /usr/local/ncpa/etc/ncpa.cfg.d for these check:
[passive checks]
%HOSTNAME%|__HOST__ = system/agent_version
%HOSTNAME%|Disk Usage = disk/logical/|/used_percent --warning 80 --critical 90 --units Gi
%HOSTNAME%|CPU Usage = cpu/percent --warning 60 --critical 80 --aggregate avg
%HOSTNAME%|Swap Usage = memory/swap --warning 60 --critical 80 --units Gi
%HOSTNAME%|Memory Usage = memory/virtual --warning 80 --critical 90 --units Gi
%HOSTNAME%|Process Count = processes --warning 400 --critical 600

Besides I installed plugins on both nodes, showing the status of the DRBD blocks for that node (primary or secondary, synced, UpToDate, etc.). These are plugins I downloaded from exchange.nagios.org. Nagios shows this information fine.

%HOSTNAME%|DRBD status 1 = plugins/nagios.drbd.sh?args=0
%HOSTNAME%|DRBD status All = plugins/check_drbd?args=-d 0,1,2,3

I installed the check_http plugin so I can monitor NGINX health easily, since that connects to a hostname it is independent of from which node it is executed, it doesn't have to be running on the active cluster node. It's only that both servername01 and servername02 show the NGINX status on Nagios as both machines run the plugin.

%HOSTNAME%|NGINX Cluster = plugins/check_http?args=-H servername (the cluster hostname)

Now I need to check the status for MySQL on the named service in the active host (as it is not available on the passive host), and I need to check whether some processes are running in the named service that runs our application system in the active host (as this is also not available in the passive host).

I use the default NCPA API endpoint in processes/name to determine whether a process is running. I can only do that for the active cluster node, as the other node is slave and does not have the named service with our application system running.

%HOSTNAME%|MyApplication = processes?name=myapp&critical=0

Now here are my questions :)

- I tried the check_cluster plugin, but since it is a passive check, I cannot run that on the Nagios host and need to run it through NCPA on servername01 and servername02. How can I configure that?

- Can I run NCPA on servername01 and servername02 separately, and additionally configure NCPA to do the MySQL and my processes stuff only for the active cluster node (servername)? Because NCPA is started before the cluster is started, NCPA is already running on servername01 and servername02 when the cluster node (servername) is started. The named services, however, are only available on the active node, so if I add the checks on both machines I will get errors on the slave node because it can't find these services or processes, while actually they could be running fine but on the master cluster node. The processes check on the slave node (e.g. servername01) would return 0 as number of running processes, and the master node (e.g. servername02) will return 1 as it is running on that host. It would only be critical if the check returns 0 on both machines, or actually when the check returns 0 on the active cluster node called servername.

- I used the plugins check_mysql and check_mysql_query , ./check_mysql: error while loading shared libraries: libmysqlclient.so.18. Doing a locate libmysql shows that it exists though.

I used check_mysql_health but that doesn't run either:
CRITICAL - cannot connect to information_schema. install_driver(mysql) failed: Can't locate DBD/mysql.pm in @INC (@INC contains: . /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at (eval 13) line 3.

- How can I monitor the health of the cluster and nodes and named services configured using NCPA?

I guess it's a lot, but help would really be appreciated.

Best regards, Theo
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Active/Passive HA cluster nodes

Post by ssax »

Configure the passives (on both nodes):

Code: Select all

https://www.nagios.org/ncpa/help.php#passive
https://assets.nagios.com/downloads/ncpa/docs/Using-NCPA-For-Passive-Checks.pdf
For cluster checks (intending to check if a service is at least running on one of the nodes based on the output of other checks), do this:

a) Disable Notifications on those specific cluster services
b) check_cluster needs to be run from the Nagios server as it uses the status of other checks which uses macros that are ONLY found in the nagios system (the Test Check Command button will not work for the check_cluster plugin, it will only be able to get the statuses of the other services while running in the backend)

Here's a simple example for you:

1. Make sure that you are monitoring the services (PING in this example) on all servers (you can disable notifications for them, this is important so you don't notifications when they are down), these service checks are what will be used by the check_cluster plugin and need to exist.

2. Create a new command:
- Command Name: check_service_cluster
- Command Line: $USER1$/check_cluster --service -l $ARG1$ -w $ARG2$ -c $ARG3$ -d '$ARG4$'
- Command Type: check command

3. Create the service cluster check:
- Description: PING_Cluster
- Check command: check_service_cluster
- $ARG1$: PING_Cluster
- $ARG2$: 4 <- Set this to one MORE than your total number of services (3 services + 1 = 4) - We don't care about warnings in this example
- $ARG3$: 2 <- Set this to one LESS than your total number of services (3 services - 1 = 2)
- $ARG4$: $SERVICESTATEID:yourhost1:PING$,$SERVICESTATEID:yourhost2:PING$,$SERVICESTATEID:yourhost3:PING$

NOTE: The hostname and the service description in $ARG4$ need to be exact (case sensitive).

The way this would work is that whenever that service is not running on ANY of the nodes it would generate a CRITICAL. So the check_cluster uses the statuses of all of each individual service checks to determine if there is an issue and since you disabled the notifications on the individual services you won't get those notifications, this is the service that will do the notifying.

Please read here for more information:

https://assets.nagios.com/downloads/nag ... sters.html

So really just setting up NCPA on each nodes with the same passives (with different hostnames in Nagios) and only monitoring the individual nodes for space/etc on non-cluster resources just use the regular results. Use the check_cluster plugin on the XI side to check your cluster services based on the output (still try to check cluster resources on each system but use check_cluster as intended to alert when it's not running on any of the nodes.

---

For your mysql issue:

Code: Select all

yum -y install perl-DBD-MySQL
Let me know if you have any questions or if I can clarify anything.

Thank you!
Locked