nagios HPC cluster event correlation

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
steelah1
Posts: 2
Joined: Tue Oct 09, 2012 6:35 pm

nagios HPC cluster event correlation

Post by steelah1 »

In monitoring an hpc cluster w 2500 nodes, is there a open source way to set some notification logic, so if an identical event occurs across dozens, hundreds, or even all 2500 nodes, that I can reduce the notifications down to just a few or even one? I've seen commercial products like BigPanda, etc. that do this "event correlation" and notification reduction, but it seems there should be a way to do this programmatically, or via some open source application? Thanks in advance!
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: nagios HPC cluster event correlation

Post by scottwilkerson »

What you are looking for is check_cluster.
https://assets.nagios.com/downloads/nag ... sters.html

Setup a a service to monitor the cluster and send notifications on that instead of each of the individual services
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: nagios HPC cluster event correlation

Post by ssax »

There's nothing inherently built into Core for this, I did find this:

https://flapjack.io/
https://flapjack.io/docs/2.x/usage/Configuring-Nagios/

You can create a feature request here if you'd like:

https://github.com/NagiosEnterprises/na ... issues/new
Locked