nagios HPC cluster event correlation
Posted: Tue Jun 18, 2019 10:06 pm
In monitoring an hpc cluster w 2500 nodes, is there a open source way to set some notification logic, so if an identical event occurs across dozens, hundreds, or even all 2500 nodes, that I can reduce the notifications down to just a few or even one? I've seen commercial products like BigPanda, etc. that do this "event correlation" and notification reduction, but it seems there should be a way to do this programmatically, or via some open source application? Thanks in advance!