Page 1 of 2

Host alerts down but server is up

Posted: Tue Sep 01, 2015 4:22 pm
by future ruins
Greetings!

I have a host that consistently alerts status as Down every 5 or 10 minutes, and then its next check succeeds. We receive flapping notifications during this activity. It will show Critical at a duration of -39s or around that mark (screenshot attached) and then it goes back to an OK status. I confirmed that the host is accessible and not having any problems when this happens. I don't suspect a networking issue, as the other hosts in this office do not have this problem. I've tried to re-add the host through the monitoring wizard, and attempted various check durations to no avail. I'm still kind of a noob with Nagios and I'm not sure what else I could do to fix this. Thanks in advance for any advice!

Re: Host alerts down but server is up

Posted: Tue Sep 01, 2015 4:50 pm
by Box293
Can you show us the performance graphs for this host (last 4 hours)? It might be the round trip time which is causing issues, not packet loss.

When this problem next occurs, go to the host object and click the Advanced tab, take a screenshot.

Re: Host alerts down but server is up

Posted: Tue Sep 01, 2015 5:56 pm
by future ruins
Thanks for the reply, I'll record this information and post it here shortly.

Re: Host alerts down but server is up

Posted: Tue Sep 01, 2015 6:29 pm
by future ruins
Hello Box293,

Does the attached help?

Re: Host alerts down but server is up

Posted: Tue Sep 01, 2015 6:36 pm
by Box293
This screenshot tells us the information we are after:

Image

You can see in the performance data string pl=100% which is 100% packet loss.

So the ping/icmp check by default fires off 5 packets, it's appearing as though it's losing all 5 packets.

Is it just connectivity to this host that is a problem?

Is this host in the same subnet as the XI server?

From what you say, Nagios shows it down for one check interval (1/5) but recovers on the next check attempt, so it never really gets to a HARD down?

Re: Host alerts down but server is up

Posted: Tue Sep 01, 2015 6:45 pm
by future ruins
Correct, it is only this particular host that is exhibiting this activity.

The host is not in the same subnet as the Nagios server. However, I am monitoring many hosts on the same subnet as the affected host and they aren't showing this kind of activity.

You are correct that it shows down for one check interval and then recovers on the next and this happens every 5/10 minutes. It shows Critical red status down but then recovers to up. I've confirmed that I can ping the affected host from the subnet that Nagios is on, even when its showing down in Nagios.

Re: Host alerts down but server is up

Posted: Tue Sep 01, 2015 8:40 pm
by Box293
Is anything logged in these files when the problem occurs:
/var/log/messages
/usr/local/nagios/var/nagios.log


I suggest starting a constant ping in an SSH session on your XI server to the host in question. When the problem occurs do you see packets dropped?

Re: Host alerts down but server is up

Posted: Wed Sep 02, 2015 10:42 am
by future ruins
I've pasted some of the errors in the attached log from the log files your requested. I tried a continuous ping to the affected host from the XI cli over a 2 minute period and it reports the following:

ping statistics ---
156 packets transmitted, 111 received, 28% packet loss, time 156333ms
rtt min/avg/max/mdev = 26.792/32.432/266.600/24.965 ms
You have mail in /var/spool/mail/root

This is the activity of the continuous ping when the host is showing both up and down in Nagios.

When I do the same for any of the other hosts on that subnet, they report 0% packet loss.

The host in question is a VMware ESXI server. The other ESX hosts in that farm have no issues with packet loss. Strange that it is just this one.

Re: Host alerts down but server is up

Posted: Wed Sep 02, 2015 10:49 am
by jdalrymple
future ruins wrote:ping statistics ---
156 packets transmitted, 111 received, 28% packet loss, time 156333ms
rtt min/avg/max/mdev = 26.792/32.432/266.600/24.965 ms
You have mail in /var/spool/mail/root
The packet loss is most certainly your issue. I suspect some sort of arp poisoning or something. I would inspect the arp debug of the switch that your vmkernel port is connected to and also take a look at that VMware host's events. It's fairly safe to eliminate the Nagios server from being the problem since other hosts are fine. To be certain you could try that same persistent ping from a nearby server (when I say nearby, I mean near the Nagios server).

Re: Host alerts down but server is up

Posted: Wed Sep 02, 2015 11:02 am
by future ruins
Yep, a continuous ping from another host on the Nagios server subnet is producing the same packet loss. I guess I should have tried that first when I did the regular ping from another host. Very strange but I agree, that at this point, we can rule out Nagios being the problem. I will investigate further with my network team. Thanks for your help regardless.