Page 1 of 1

Very strange repeating alert (Nagios XI eval)

Posted: Mon Sep 14, 2015 10:12 am
by stf_792
Hi,

I am evaluating Nagios XI and currently has about 200 hosts with about 700 services being monitored

Recently I noticed a strange issue.

I have VSphere based host (one of many) with 4 VMs running on that host.

Two of those four VMs are generating UP/Down and then Flapping alert every day. It almost (few minutes variance) same time every day.

None of the other 198 hosts do that.

During "DOWN" time, both servers are pingable from Nagios server and from other PCs on the network. Both servers are Windows based that came from same template, but there are around 5 more servers that came from same template and do not produce the error.

Nothing in monitored servers logs.

Searching does not produce any meaningful results. Some pointers where to look will be greatly appreciated.

Re: Very strange repeating alert (Nagios XI eval)

Posted: Mon Sep 14, 2015 10:21 am
by jdalrymple
Sounds creepishly like this:

https://support.nagios.com/forum/viewto ... el#p151115

At the end of the day we decided it was a network problem on his end. Not to say that's your problem also, but we'll need more info to start with.

1) How were the failing hosts set up? Windows server wizard?
2) What is the status output (not just the status, but the description to go with it) after the hosts enter a down state?
3) Does a persistent ping over 10 minutes or so from your Nagios box yield no packet loss?

Re: Very strange repeating alert (Nagios XI eval)

Posted: Mon Sep 14, 2015 10:49 am
by stf_792
1) How were the failing hosts set up? Windows server wizard?

Hosts setup by running Auto Discovery Wizard on VLAN, selecting Ping and Netbios

2) What is the status output (not just the status, but the description to go with it) after the hosts enter a down state?

Code: Select all

time	host	service	statechange	state	statetype	currentattempt	maxattempts	laststate	lasthardstate	information
9/14/2015 0:26	<my_server_name>.my.domain	NetBIOS	1	OK	HARD	5	5	CRITICAL	CRITICAL	TCP OK - 0.001 second response time on <my_server_name>.my.domain port 139
9/14/2015 0:25	<my_server_name>.my.domain		1	UP	HARD	1	5	DOWN	UP	OK - <my_server_name>.my.domain: rta 0.311ms, lost 0%
9/14/2015 0:25	<my_server_name>.my.domain	Ping	1	OK	HARD	5	5	CRITICAL	CRITICAL	OK - <my_server_name>.my.domain: rta 0.427ms, lost 0%
9/14/2015 0:20	<my_server_name>.my.domain	Ping	1	CRITICAL	HARD	5	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:19	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	4	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:18	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	3	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:17	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	2	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:16	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	1	5	OK	OK	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:16	<my_server_name>.my.domain		1	DOWN	HARD	5	5	UP	UP	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:16	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	HARD	5	5	CRITICAL	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/14/2015 0:15	<my_server_name>.my.domain		1	DOWN	SOFT	4	5	UP	UP	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:15	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	4	5	CRITICAL	OK	CRITICAL - Socket timeout after 10 seconds
9/14/2015 0:14	<my_server_name>.my.domain		1	DOWN	SOFT	3	5	UP	UP	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:14	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	3	5	CRITICAL	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/14/2015 0:13	<my_server_name>.my.domain		1	DOWN	SOFT	2	5	UP	UP	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/14/2015 0:13	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	2	5	CRITICAL	OK	CRITICAL - Socket timeout after 10 seconds
9/14/2015 0:12	<my_server_name>.my.domain		1	DOWN	SOFT	1	5	UP	UP	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/14/2015 0:12	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	1	5	OK	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:42	<my_server_name>.my.domain	NetBIOS	1	OK	HARD	5	5	CRITICAL	CRITICAL	TCP OK - 0.001 second response time on <my_server_name>.my.domain port 139
9/13/2015 23:41	<my_server_name>.my.domain		1	UP	HARD	1	5	DOWN	UP	OK - <my_server_name>.my.domain: rta 0.325ms, lost 0%
9/13/2015 23:41	<my_server_name>.my.domain	Ping	1	OK	HARD	5	5	CRITICAL	CRITICAL	OK - <my_server_name>.my.domain: rta 0.665ms, lost 0%
9/13/2015 23:27	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	HARD	5	5	CRITICAL	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:26	<my_server_name>.my.domain	Ping	1	CRITICAL	HARD	5	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:26	<my_server_name>.my.domain		1	DOWN	HARD	5	5	UP	UP	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:26	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	4	5	CRITICAL	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:25	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	4	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:25	<my_server_name>.my.domain		1	DOWN	SOFT	4	5	UP	UP	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:25	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	3	5	CRITICAL	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:24	<my_server_name>.my.domain		1	DOWN	SOFT	3	5	UP	UP	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:24	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	3	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: Host unreachable @ 10.8.8.234. rta nan, lost 100%
9/13/2015 23:24	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	2	5	CRITICAL	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:24	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	2	5	CRITICAL	OK	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:23	<my_server_name>.my.domain		1	DOWN	SOFT	2	5	UP	UP	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:23	<my_server_name>.my.domain	NetBIOS	1	CRITICAL	SOFT	1	5	OK	OK	connect to address <my_server_name>.my.domain and port 139: No route to host
9/13/2015 23:23	<my_server_name>.my.domain		1	DOWN	SOFT	1	5	UP	UP	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%
9/13/2015 23:23	<my_server_name>.my.domain	Ping	1	CRITICAL	SOFT	1	5	OK	OK	CRITICAL - <my_server_name>.my.domain: rta nan, lost 100%

3) Does a persistent ping over 10 minutes or so from your Nagios box yield no packet loss?

Did not tried continuously, but will try tonight.

Re: Very strange repeating alert (Nagios XI eval)

Posted: Mon Sep 14, 2015 11:39 am
by jdalrymple
stf_792 wrote:Did not tried continuously, but will try tonight.
Please do.

There isn't a lot of magic with check_ping. I personally have not seen an instance where pinging a server worked, but check_ping yielded lost packets as you're seeing.

When you're running your persistent ping watch NagiosXI simultaneously and see if they offer up similar data. I expect they will.