nagios HPC cluster event correlation
nagios HPC cluster event correlation
In monitoring an hpc cluster w 2500 nodes, is there a open source way to set some notification logic, so if an identical event occurs across dozens, hundreds, or even all 2500 nodes, that I can reduce the notifications down to just a few or even one? I've seen commercial products like BigPanda, etc. that do this "event correlation" and notification reduction, but it seems there should be a way to do this programmatically, or via some open source application? Thanks in advance!
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: nagios HPC cluster event correlation
What you are looking for is check_cluster.
https://assets.nagios.com/downloads/nag ... sters.html
Setup a a service to monitor the cluster and send notifications on that instead of each of the individual services
https://assets.nagios.com/downloads/nag ... sters.html
Setup a a service to monitor the cluster and send notifications on that instead of each of the individual services
Re: nagios HPC cluster event correlation
There's nothing inherently built into Core for this, I did find this:
https://flapjack.io/
https://flapjack.io/docs/2.x/usage/Configuring-Nagios/
You can create a feature request here if you'd like:
https://github.com/NagiosEnterprises/na ... issues/new
https://flapjack.io/
https://flapjack.io/docs/2.x/usage/Configuring-Nagios/
You can create a feature request here if you'd like:
https://github.com/NagiosEnterprises/na ... issues/new