Nagios False Alerts

ivp2015 · Post by **ivp2015** » Sat Feb 16, 2019 12:50 pm

Hi Team,

We have configured Nagios monitoring of 12 Linux Server vai Nagios which is hosted on AWS. In the configuration part we had enabled Disk, load, ping and port monitoring as well.

Today at around 05:39PM (IST)we got down alerts for only 5 servers and after 2min received UP alerts 5:41PM(IST). We have checked all the server but did't find any think on all the servers side.Also we check all server uptime and last reboot.So it seems it was a false positive. Need your help to identify why these alerts were triggered.
Below are the Configuration details.

Check Interval 2 min
Retry Interval 1 min
Max check Attempts 2

Service configuration.
1- Check Disk.
2- Check Load.
3- Application port no.
4- Ping.
5-SSH

Please let us know if any other details required.

ivp2015 · Post by **ivp2015** » Sat Feb 16, 2019 12:51 pm

Same configuration setting for all 12 linux Servers.

Check Interval 2 min
Retry Interval 1 min
Max check Attempts 2

Service configuration.
1- Check Disk.
2- Check Load.
3- Application port no.
4- Ping.
5-SSH

ivp2015 · Post by **ivp2015** » Sat Feb 16, 2019 9:56 pm

Down Alerts Looks.
***** Nagios Hardware Alert *****

Nagios has detected a problem with this host.

Notification Type: PROBLEM
Host:
State: DOWN
Address: host IP
Info: CRITICAL - host IP: rta nan, lost 100%
Date/Time: 2019-02-16 17:40:49

ivp2015 · Post by **ivp2015** » Mon Feb 18, 2019 11:54 am

Hi Team,
Any update on this.

Post by **cdienger** » Mon Feb 18, 2019 4:34 pm

It lost the ability to ping these servers judging by the notification provided. These are usually due to networking issues. For example, this could occur if the route between the XI server and monitored machine goes down, a firewall drops the icmp packets used by the check to determine if the monitored server is up or down, or if the IP of the destination changed. Anything that could potentially prevent a ping between XI and the monitored machine from working essentially.

ivp2015 · Post by **ivp2015** » Mon Feb 18, 2019 5:32 pm

Oky. But why this was happen with only 5 servers. And other server notification setting are same.

Post by **cdienger** » Mon Feb 18, 2019 5:44 pm

Either because it could ping those servers during that time or because the check for those servers may not have run during that time there was an issue.

ssax · Post by **ssax** » Mon Feb 18, 2019 5:45 pm

Given that these servers are all behind AWS, what did the output of the other services say at that point? Could it have been a firewall issue with AWS or something?

We would need to see the output of the checks during that time to see if we can glean any additional information from the services.

Did all of the AWS hosts give the same "100% lost" host results at the same time? Or did you have some that did and some that didn't?

ivp2015 · Post by **ivp2015** » Mon Feb 18, 2019 6:04 pm

Yes All the server hosts on AWS. If there is any network issue than all the servers notification and there services( Ping, SSH, DISK, LOAD ) notification should be down or critical. But we got only five servers was down alerts and there is no alerts for services like.( Ping, SSH, DISK, LOAD )

Post by **cdienger** » Thu Feb 21, 2019 2:56 pm

Not getting notifications for services when a host is down is expected - if a host is down there's no need to spam people with emails with service notifications since it's assumed that these are down as well.

Nagios identified a few machines that it wasn't able to ping. It's clear that these checks ran and that no response was received from the remote machines. Determining why they failed isn't something that can really be investigated unless the issue is currently happening.

Nagios Support Forum

Nagios False Alerts

Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts

Re: Nagios False Alerts