Monitor sites on nodes

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
amprantino
Posts: 140
Joined: Thu Apr 18, 2013 8:25 am
Location: libexec

Monitor sites on nodes

Post by amprantino »

Hello everyone,

I have a cluster composed from two nodes.
All users normally access the various sites through the cluster IP.
I host various sites on each node. All DNS/vhost point to the cluster IP.

How can I monitor the availability of the sites on each node?

Regards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Monitor sites on nodes

Post by mcapra »

I would think step 0 would be defining what "availability" looks like.

If the sites are available locally on those machines, you could use check_http coupled with an agent like NRPE/NCPA to attempt to reach them through localhost.

But that doesn't guarantee that those *specific* nodes are reachable through the VIP or DNS. That's a much trickier problem to solve and the solution depends on a whole lot of unknown things regarding this setup. You'd need some sort of reliable way to coerce the VIP/DNS to resolve to specific nodes based on something provided by the client. HTTP header info, POST/GET params, etc.

I have a few central services powered by Kubernetes, which I assume is a comparable problem to yours. I have orchestration in our k8s deployments that makes sure each container supporting a specific microservice exposes relevant KPIs to our monitoring solution. Stuff like "how many requests I've serviced", "uptime", etc. If the "uptime" for a particular container is beyond a few minutes and it hasn't serviced a single request, we'd expect k8s to scale down. If k8s doesn't scale down, either the container is hung or something about the deployment is wrong -- either way, we get notified and can go check it out.

Replace "uptime" and "requests serviced" with whatever KPIs are relevant for your particular setup and have Nagios poll them periodically.
Former Nagios employee
https://www.mcapra.com/
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Monitor sites on nodes

Post by ssax »

Thanks mcapra!

Even though everybody usually accesses them through the VIP/DNS names they should still have their own unique IP address that you could use to monitor those specific sites on those machines.

I think mcapra is correct, we need to know what specifically you're trying to monitor with the sites. Do the sites flop between the cluster nodes or do they stay on their specific node? Are you just trying to determine which node they are on or what specifically?
amprantino
Posts: 140
Joined: Thu Apr 18, 2013 8:25 am
Location: libexec

Re: Monitor sites on nodes

Post by amprantino »

Thanks for your reply. mcapra: I am configuring servers/services using exactly the same philosophy (no Kubernetes though).

Specific nodes are being checked from the Load Balancers (F5) for their availability based on various metrics.
If the nodes are not responding are being removed from the pool. That works fine for the end users.
Connections are shared using various algorithms among the nodes.

On the other side, I would like to get alerts from Nagios if there is a problem on a node.
Hence, I am searching for a solution to check all virtual hosts to the nodes.
As an example, I would like to check site1.mydomain.com on Node1 and Node2

At the moment, I would like to avoid installing NRPE/NSCA and I would like to check the sites externally.
The actual problem is how to translate VIP/DNS names to the node IP so I can monitor the correct site.

check_http manual says:

-I, --IP-address=ADDRESS
IP address or name (use numeric address if possible to bypass DNS lookup).

Probably a combination of DNS & IP will do the work.
amprantino
Posts: 140
Joined: Thu Apr 18, 2013 8:25 am
Location: libexec

Re: Monitor sites on nodes

Post by amprantino »

I am manually checking the following:

Node1: ./check_http --ssl --sni -E -I -t 10 -p 443 -H site1.mydomain.com -I 10.128.5.53 -u "/test123.txt"
Node2: ./check_http --ssl --sni -E -I -t 10 -p 443 -H site1.mydomain.com -I 10.128.5.54 -u "/test123.txt"

This seems to work as expected.
Now a PHP page doing all the necessary checks and replying a Site_OK" if everything is OK.
check_http will match the text and send back OK state to nagios.

As an example:

Node2: ./check_http --ssl --sni -E -I -t 10 -p 443 -H site1.mydomain.com -I 10.128.5.54 -u "/test123.txt"
"-r "Site_OK""

HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.015 second response time |time=0.014818s;;;0.000000 size=247B;;;0 time_connect=0.002776s;;; time_ssl=0.010134s;;; time_headers=0.000038s;;; time_firstbyte=0.001744s;;; time_transfer=0.001750s;;;
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Monitor sites on nodes

Post by ssax »

You might look at using the check_cluster plugin.

1. Make sure that you are monitoring the sites like you are with check_http individually on all nodes (you can disable notifications for them, I'm using PING service in the example below since I don't know the name of your services), these service checks are what will be used by the check_cluster plugin and need to exist.

2. Create a new command:
- Command Name: check_service_cluster
- Command Line: $USER1$/check_cluster --service -l $ARG1$ -w $ARG2$ -c $ARG3$ -d '$ARG4$'
- Command Type: check command

3. Create the service cluster check:
- Description: Site_Cluster_Check
- Check command: check_service_cluster
- $ARG1$: Site_Cluster
- $ARG2$: 3 <- Setting this to 1 higher than total number of nodes stops it from generating a WARNING
- $ARG3$: 1 <- Setting this to 1 less than total nodes will generate a CRITICAL if ALL of the services are in problem state
- $ARG4$: $SERVICESTATEID:yourhost1:PING$,$SERVICESTATEID:yourhost2:PING$

NOTE: You'll need to change the hostnames and the service descriptions in $ARG4$, they need to be exact (case sensitive).

The way this would work is that whenever that service is not running on ANY of the nodes it would generate a critical. So the check_cluster uses the statuses of both of each individual service checks to determine if there is an issue.

Please read here for more information:

https://assets.nagios.com/downloads/nag ... sters.html
Locked