Page 1 of 1

Nagios False Alerts

Posted: Sat Feb 16, 2019 12:50 pm
by ivp2015
Hi Team,

We have configured Nagios monitoring of 12 Linux Server vai Nagios which is hosted on AWS. In the configuration part we had enabled Disk, load, ping and port monitoring as well.

Today at around 05:39PM (IST)we got down alerts for only 5 servers and after 2min received UP alerts 5:41PM(IST). We have checked all the server but did't find any think on all the servers side.Also we check all server uptime and last reboot.So it seems it was a false positive. Need your help to identify why these alerts were triggered.
Below are the Configuration details.

Check Interval 2 min
Retry Interval 1 min
Max check Attempts 2

Service configuration.
1- Check Disk.
2- Check Load.
3- Application port no.
4- Ping.
5-SSH

Please let us know if any other details required.

Re: Nagios False Alerts

Posted: Sat Feb 16, 2019 12:51 pm
by ivp2015
Same configuration setting for all 12 linux Servers.

Check Interval 2 min
Retry Interval 1 min
Max check Attempts 2

Service configuration.
1- Check Disk.
2- Check Load.
3- Application port no.
4- Ping.
5-SSH

Re: Nagios False Alerts

Posted: Sat Feb 16, 2019 9:56 pm
by ivp2015
Down Alerts Looks.
***** Nagios Hardware Alert *****

Nagios has detected a problem with this host.

Notification Type: PROBLEM
Host:
State: DOWN
Address: host IP
Info: CRITICAL - host IP: rta nan, lost 100%
Date/Time: 2019-02-16 17:40:49

Re: Nagios False Alerts

Posted: Mon Feb 18, 2019 11:54 am
by ivp2015
Hi Team,
Any update on this.

Re: Nagios False Alerts

Posted: Mon Feb 18, 2019 4:34 pm
by cdienger
It lost the ability to ping these servers judging by the notification provided. These are usually due to networking issues. For example, this could occur if the route between the XI server and monitored machine goes down, a firewall drops the icmp packets used by the check to determine if the monitored server is up or down, or if the IP of the destination changed. Anything that could potentially prevent a ping between XI and the monitored machine from working essentially.

Re: Nagios False Alerts

Posted: Mon Feb 18, 2019 5:32 pm
by ivp2015
Oky. But why this was happen with only 5 servers. And other server notification setting are same.

Re: Nagios False Alerts

Posted: Mon Feb 18, 2019 5:44 pm
by cdienger
Either because it could ping those servers during that time or because the check for those servers may not have run during that time there was an issue.

Re: Nagios False Alerts

Posted: Mon Feb 18, 2019 5:45 pm
by ssax
Given that these servers are all behind AWS, what did the output of the other services say at that point? Could it have been a firewall issue with AWS or something?

We would need to see the output of the checks during that time to see if we can glean any additional information from the services.

Did all of the AWS hosts give the same "100% lost" host results at the same time? Or did you have some that did and some that didn't?

Re: Nagios False Alerts

Posted: Mon Feb 18, 2019 6:04 pm
by ivp2015
Yes All the server hosts on AWS. If there is any network issue than all the servers notification and there services( Ping, SSH, DISK, LOAD ) notification should be down or critical. But we got only five servers was down alerts and there is no alerts for services like.( Ping, SSH, DISK, LOAD )

Re: Nagios False Alerts

Posted: Thu Feb 21, 2019 2:56 pm
by cdienger
Not getting notifications for services when a host is down is expected - if a host is down there's no need to spam people with emails with service notifications since it's assumed that these are down as well.

Nagios identified a few machines that it wasn't able to ping. It's clear that these checks ran and that no response was received from the remote machines. Determining why they failed isn't something that can really be investigated unless the issue is currently happening.