Page 1 of 1

nagios HPC cluster event correlation

Posted: Tue Jun 18, 2019 10:06 pm
by steelah1
In monitoring an hpc cluster w 2500 nodes, is there a open source way to set some notification logic, so if an identical event occurs across dozens, hundreds, or even all 2500 nodes, that I can reduce the notifications down to just a few or even one? I've seen commercial products like BigPanda, etc. that do this "event correlation" and notification reduction, but it seems there should be a way to do this programmatically, or via some open source application? Thanks in advance!

Re: nagios HPC cluster event correlation

Posted: Wed Jun 19, 2019 4:27 pm
by scottwilkerson
What you are looking for is check_cluster.
https://assets.nagios.com/downloads/nag ... sters.html

Setup a a service to monitor the cluster and send notifications on that instead of each of the individual services

Re: nagios HPC cluster event correlation

Posted: Wed Jun 19, 2019 4:35 pm
by ssax
There's nothing inherently built into Core for this, I did find this:

https://flapjack.io/
https://flapjack.io/docs/2.x/usage/Configuring-Nagios/

You can create a feature request here if you'd like:

https://github.com/NagiosEnterprises/na ... issues/new