Host recovery emails being sent while host is unreachable

BIB · Post by **BIB** » Wed Dec 28, 2016 6:33 am

Hey guys,

We seem to be getting host recovery emails even though the host is unreachable.

This is on Nagios XI version 5.3.4

Notification Type: RECOVERY
Host: TLC_Bankomat Upravna zgrada 3. sprat
State: UP
Address: IP ADDRESS
Info: OK - IP ADDRESS: rta 4.244ms, lost 20%
Date/Time: 2016-12-27 20:46:47

First next check shows that host is unreachable, and we know for sure that this host is not online because of the power failure.
It happens on all host no matter what kind of host it is, type of device, IP address, etc.
Problem occurs with checks based on PING, when the device is unreachable for a period of time (longer than few hours).
There is no rule with time period when this problem occurs.

We have same configuration running on Nagios® Core™ Version 3.2.3, which is supposed to be replaced with Nagios XI.
We do not experience this problem on Nagios Core.

Thank you in advance.

rkennedy · Post by **rkennedy** » Wed Dec 28, 2016 10:27 am

Are you using check_ping or check_icmp? Can you show us the full host definition, and command definition for the related objects?

BIB · Post by **BIB** » Thu Dec 29, 2016 3:05 am

We tried with both check commands, no success. Currently, check_icmp is set.
The problem occurs no matter of check command and no matter if we acknowledge the problem or not.

Here are definitions of related objects:

define host {
name host_tmplt
check_command check-host-alive!!!!!!!!
max_check_attempts 10
check_interval 5
retry_interval 1
check_period 24x7
flap_detection_enabled 1
contact_groups Telecom
notification_interval 120
notification_period 24x7
first_notification_delay 0
notification_options d,u,r,
register 0
}

define command {
command_name check-host-alive
command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}

define host {
host_name Host_definition
use host_tmplt
alias Airpoty City
address *IP_ADDRESS*
register 1
}

rkennedy · Post by **rkennedy** » Thu Dec 29, 2016 10:22 am

I haven't heard of this happening before. The results seem to indicate that traffic is possible at time - if you run a ping -t <ip-address> until an email alert comes - do the results ever align with what Nagios is seeing?

If it's a false alarm, I would look to see if multiple nagios processes are running. What is the output of ps -ef | grep nagios.cfg?

Lastly, if you'd like to ignore these, you could set them to check_dummy 0 - which will always indicate an 'OK' state.

BIB · Post by **BIB** » Fri Dec 30, 2016 2:35 am

Output of ps -ef | grep nagios.cfg
UID PID PPID C STIME TTY TIME CMD
nagios 5353 1 0 Dec13 ? 00:10:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 5429 5353 0 Dec13 ? 00:02:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11003 1 3 08:01 ? 00:00:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11081 11003 0 08:01 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36932 1 0 Dec07 ? 00:44:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 37009 36932 0 Dec07 ? 00:05:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Yes, we tried ping -t from that host, no response received actually. We also have the same configuration on another host running old Nagios Core, this problem is not present.
We are sure these alarms are false, we know that the host is unreachable (cable unplugged) and we receive notifications that host is up, after a few minutes (first next check) host is seen as down, as it should be.

rkennedy · Post by **rkennedy** » Fri Dec 30, 2016 9:23 am

I believe the multiple processes is affecting this. You have 3 nagios processes that are all spawned from PID 1, when workers should spawn from the actual PID. I would run a killall for nagios, and then start just a single one with service nagios start

From there, run a ps -ef | grep nagios.cfg again, and you should see only two running like this -

Code: Select all

nagios 5353 1 0 Dec13 ? 00:10:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 5429 5353 0 Dec13 ? 00:02:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

BIB · Post by **BIB** » Mon Jan 09, 2017 5:41 am

Now we have only 2 processes, as you explained, but the problem is still present.
System profile page from Nagios XI is copied into file attached.
Thread below seems similar, could you check what was resolution of that problem?
https://support.nagios.com/forum/viewto ... ifications

rkennedy · Post by **rkennedy** » Mon Jan 09, 2017 2:35 pm

Can you please attach the nagios.log from 12-27-2016? It'll be located at /usr/local/nagios/var/archives/nagios-12-27-2016-00.log.

I'd like to see what the system is reporting at that time through the log file.

BIB · Post by **BIB** » Tue Jan 10, 2017 4:31 am

Here is nagios log for December 27th 2016.

rkennedy · Post by **rkennedy** » Tue Jan 10, 2017 9:33 am

It does indeed look like it detected the host alert change -

Code: Select all

[1482868004] HOST ALERT: TLC_Bankomat Upravna zgrada 3. sprat;UP;HARD;5;OK - 172.21.228.210: rta 4.244ms, lost 20%

It could be a bug in the plugins - what version are you running?

Code: Select all

/usr/local/nagios/libexec/check_ping -V
/usr/local/nagios/libexec/check_icmp -V

Nagios Support Forum

Host recovery emails being sent while host is unreachable

Host recovery emails being sent while host is unreachable

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl

Re: Host recovery emails being sent while host is unreachabl