Nagios redundancy in AWS

[email protected] · Post by **[email protected]** » Wed Sep 20, 2017 7:58 am

Has anyone successfully built instances of Nagios out in AWS that is effectively load balanced in a way that alerting is not duplicated?

What I would like to achieve is to have Nagios servers running in multiple availability zones within AWS to ensure monitoring functions as close to 100% of the time as possible. However, this presents challenges in some aspects of monitoring. For instance, if you have multiple instances of Nagios all polling for a specific device and that device goes down, you are going to get multiple alerts for that one incident. I think the inbound traps are easily solved with a load balancer but I dont know what it means for one of three Nagios servers to get a trap.

What are peoples thoughts on this design aspect?

Post by **mcapra** » Wed Sep 20, 2017 8:08 am

That sort of setup is likely to cause some fragmented reporting because only one XI instance is receiving the actual check information. This of course doesn't matter if you don't care about reporting or storing time-series data regarding your services; The setup should work in this case assuming all contacts and notification settings are correctly configured.

Personally, If the primary concern is having multiple XI instances running but avoiding duplicate alerts, I'd ship alerts to a message queue (I made a dirt simple RabbitMQ component) and let your queue's consumers deal with removing duplicate messages. It's a bit lazy, but it'll ensure each XI instance has a fully copy of everything happening in your infrastructure. If you lose half of your infra, and 3 of 4 XI instances happens to be in that half, you can still have some pretty comprehensive intel to work with from the remaining XI instance.

As a note, my RabbitMQ sender I linked above would definitely have to be modified to include destination emails in this setup. Otherwise you lose out on all of the rich Nagios alerting logic and wind up doubling work.

A nice thing about RabbitMQ is it has native high-availability and fail-over options. Plus you'd have the added resilience of not black-holing your alerts when email goes down.

Post by **cdienger** » Wed Sep 20, 2017 4:26 pm

Thanks as always for the input mcapra. Did this help, Dan?

Nagios Support Forum

Nagios redundancy in AWS

Nagios redundancy in AWS

Re: Nagios redundancy in AWS

Re: Nagios redundancy in AWS