Page 1 of 1
Very strange repeating alert (Nagios XI eval)
Posted: Mon Sep 14, 2015 10:12 am
by stf_792
Hi,
I am evaluating Nagios XI and currently has about 200 hosts with about 700 services being monitored
Recently I noticed a strange issue.
I have VSphere based host (one of many) with 4 VMs running on that host.
Two of those four VMs are generating UP/Down and then Flapping alert every day. It almost (few minutes variance) same time every day.
None of the other 198 hosts do that.
During "DOWN" time, both servers are pingable from Nagios server and from other PCs on the network. Both servers are Windows based that came from same template, but there are around 5 more servers that came from same template and do not produce the error.
Nothing in monitored servers logs.
Searching does not produce any meaningful results. Some pointers where to look will be greatly appreciated.
Re: Very strange repeating alert (Nagios XI eval)
Posted: Mon Sep 14, 2015 10:21 am
by jdalrymple
Sounds creepishly like this:
https://support.nagios.com/forum/viewto ... el#p151115
At the end of the day we decided it was a network problem on his end. Not to say that's your problem also, but we'll need more info to start with.
1) How were the failing hosts set up? Windows server wizard?
2) What is the status output (not just the status, but the description to go with it) after the hosts enter a down state?
3) Does a persistent ping over 10 minutes or so from your Nagios box yield no packet loss?
Re: Very strange repeating alert (Nagios XI eval)
Posted: Mon Sep 14, 2015 10:49 am
by stf_792
1) How were the failing hosts set up? Windows server wizard?
Hosts setup by running Auto Discovery Wizard on VLAN, selecting Ping and Netbios
2) What is the status output (not just the status, but the description to go with it) after the hosts enter a down state?
Code: Select all
time host service statechange state statetype currentattempt maxattempts laststate lasthardstate information
9/14/2015 0:26 <my_server_name>.my.domain NetBIOS 1 OK HARD 5 5 CRITICAL CRITICAL TCP OK - 0.001 second response time on <my_server_name>.my.domain port 139
9/14/2015 0:25 <my_server_name>.my.domain 1 UP HARD 1 5 DOWN UP OK - <my_server_name>.my.domain: rta 0.311ms, lost 0%
9/14/2015 0:25 <my_server_name>.my.domain Ping 1 OK HARD 5 5 CRITICAL CRITICAL OK - <my_server_name>.my.domain: rta 0.427ms, lost 0%
9/14/2015 0:20 <my_server_name>.my.domain Ping 1 CRITICAL HARD 5 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:19 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 4 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:18 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 3 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:17 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 2 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:16 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 1 5 OK OK CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:16 <my_server_name>.my.domain 1 DOWN HARD 5 5 UP UP CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:16 <my_server_name>.my.domain NetBIOS 1 CRITICAL HARD 5 5 CRITICAL OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/14/2015 0:15 <my_server_name>.my.domain 1 DOWN SOFT 4 5 UP UP CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:15 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 4 5 CRITICAL OK CRITICAL - Socket timeout after 10 seconds
9/14/2015 0:14 <my_server_name>.my.domain 1 DOWN SOFT 3 5 UP UP CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:14 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 3 5 CRITICAL OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/14/2015 0:13 <my_server_name>.my.domain 1 DOWN SOFT 2 5 UP UP CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:13 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 2 5 CRITICAL OK CRITICAL - Socket timeout after 10 seconds
9/14/2015 0:12 <my_server_name>.my.domain 1 DOWN SOFT 1 5 UP UP CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:12 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 1 5 OK OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:42 <my_server_name>.my.domain NetBIOS 1 OK HARD 5 5 CRITICAL CRITICAL TCP OK - 0.001 second response time on <my_server_name>.my.domain port 139
9/13/2015 23:41 <my_server_name>.my.domain 1 UP HARD 1 5 DOWN UP OK - <my_server_name>.my.domain: rta 0.325ms, lost 0%
9/13/2015 23:41 <my_server_name>.my.domain Ping 1 OK HARD 5 5 CRITICAL CRITICAL OK - <my_server_name>.my.domain: rta 0.665ms, lost 0%
9/13/2015 23:27 <my_server_name>.my.domain NetBIOS 1 CRITICAL HARD 5 5 CRITICAL OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:26 <my_server_name>.my.domain Ping 1 CRITICAL HARD 5 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:26 <my_server_name>.my.domain 1 DOWN HARD 5 5 UP UP CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:26 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 4 5 CRITICAL OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:25 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 4 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:25 <my_server_name>.my.domain 1 DOWN SOFT 4 5 UP UP CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:25 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 3 5 CRITICAL OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:24 <my_server_name>.my.domain 1 DOWN SOFT 3 5 UP UP CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:24 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 3 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:24 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 2 5 CRITICAL OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:24 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 2 5 CRITICAL OK CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:23 <my_server_name>.my.domain 1 DOWN SOFT 2 5 UP UP CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:23 <my_server_name>.my.domain NetBIOS 1 CRITICAL SOFT 1 5 OK OK connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:23 <my_server_name>.my.domain 1 DOWN SOFT 1 5 UP UP CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:23 <my_server_name>.my.domain Ping 1 CRITICAL SOFT 1 5 OK OK CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
3) Does a persistent ping over 10 minutes or so from your Nagios box yield no packet loss?
Did not tried continuously, but will try tonight.
Re: Very strange repeating alert (Nagios XI eval)
Posted: Mon Sep 14, 2015 11:39 am
by jdalrymple
stf_792 wrote:Did not tried continuously, but will try tonight.
Please do.
There isn't a lot of magic with check_ping. I personally have not seen an instance where pinging a server worked, but check_ping yielded lost packets as you're seeing.
When you're running your persistent ping watch NagiosXI simultaneously and see if they offer up similar data. I expect they will.