Host recovery emails being sent while host is unreachable
Host recovery emails being sent while host is unreachable
Hey guys,
We seem to be getting host recovery emails even though the host is unreachable.
This is on Nagios XI version 5.3.4
Notification Type: RECOVERY
Host: TLC_Bankomat Upravna zgrada 3. sprat
State: UP
Address: IP ADDRESS
Info: OK - IP ADDRESS: rta 4.244ms, lost 20%
Date/Time: 2016-12-27 20:46:47
First next check shows that host is unreachable, and we know for sure that this host is not online because of the power failure.
It happens on all host no matter what kind of host it is, type of device, IP address, etc.
Problem occurs with checks based on PING, when the device is unreachable for a period of time (longer than few hours).
There is no rule with time period when this problem occurs.
We have same configuration running on Nagios® Core™ Version 3.2.3, which is supposed to be replaced with Nagios XI.
We do not experience this problem on Nagios Core.
Thank you in advance.
We seem to be getting host recovery emails even though the host is unreachable.
This is on Nagios XI version 5.3.4
Notification Type: RECOVERY
Host: TLC_Bankomat Upravna zgrada 3. sprat
State: UP
Address: IP ADDRESS
Info: OK - IP ADDRESS: rta 4.244ms, lost 20%
Date/Time: 2016-12-27 20:46:47
First next check shows that host is unreachable, and we know for sure that this host is not online because of the power failure.
It happens on all host no matter what kind of host it is, type of device, IP address, etc.
Problem occurs with checks based on PING, when the device is unreachable for a period of time (longer than few hours).
There is no rule with time period when this problem occurs.
We have same configuration running on Nagios® Core™ Version 3.2.3, which is supposed to be replaced with Nagios XI.
We do not experience this problem on Nagios Core.
Thank you in advance.
Re: Host recovery emails being sent while host is unreachabl
Are you using check_ping or check_icmp? Can you show us the full host definition, and command definition for the related objects?
Former Nagios Employee
Re: Host recovery emails being sent while host is unreachabl
We tried with both check commands, no success. Currently, check_icmp is set.
The problem occurs no matter of check command and no matter if we acknowledge the problem or not.
Here are definitions of related objects:
define host {
name host_tmplt
check_command check-host-alive!!!!!!!!
max_check_attempts 10
check_interval 5
retry_interval 1
check_period 24x7
flap_detection_enabled 1
contact_groups Telecom
notification_interval 120
notification_period 24x7
first_notification_delay 0
notification_options d,u,r,
register 0
}
define command {
command_name check-host-alive
command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
define host {
host_name Host_definition
use host_tmplt
alias Airpoty City
address *IP_ADDRESS*
register 1
}
The problem occurs no matter of check command and no matter if we acknowledge the problem or not.
Here are definitions of related objects:
define host {
name host_tmplt
check_command check-host-alive!!!!!!!!
max_check_attempts 10
check_interval 5
retry_interval 1
check_period 24x7
flap_detection_enabled 1
contact_groups Telecom
notification_interval 120
notification_period 24x7
first_notification_delay 0
notification_options d,u,r,
register 0
}
define command {
command_name check-host-alive
command_line $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
define host {
host_name Host_definition
use host_tmplt
alias Airpoty City
address *IP_ADDRESS*
register 1
}
Re: Host recovery emails being sent while host is unreachabl
I haven't heard of this happening before. The results seem to indicate that traffic is possible at time - if you run a ping -t <ip-address> until an email alert comes - do the results ever align with what Nagios is seeing?
If it's a false alarm, I would look to see if multiple nagios processes are running. What is the output of ps -ef | grep nagios.cfg?
Lastly, if you'd like to ignore these, you could set them to check_dummy 0 - which will always indicate an 'OK' state.
If it's a false alarm, I would look to see if multiple nagios processes are running. What is the output of ps -ef | grep nagios.cfg?
Lastly, if you'd like to ignore these, you could set them to check_dummy 0 - which will always indicate an 'OK' state.
Former Nagios Employee
Re: Host recovery emails being sent while host is unreachabl
Output of ps -ef | grep nagios.cfg
UID PID PPID C STIME TTY TIME CMD
nagios 5353 1 0 Dec13 ? 00:10:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 5429 5353 0 Dec13 ? 00:02:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11003 1 3 08:01 ? 00:00:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11081 11003 0 08:01 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36932 1 0 Dec07 ? 00:44:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 37009 36932 0 Dec07 ? 00:05:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Yes, we tried ping -t from that host, no response received actually. We also have the same configuration on another host running old Nagios Core, this problem is not present.
We are sure these alarms are false, we know that the host is unreachable (cable unplugged) and we receive notifications that host is up, after a few minutes (first next check) host is seen as down, as it should be.
UID PID PPID C STIME TTY TIME CMD
nagios 5353 1 0 Dec13 ? 00:10:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 5429 5353 0 Dec13 ? 00:02:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11003 1 3 08:01 ? 00:00:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 11081 11003 0 08:01 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 36932 1 0 Dec07 ? 00:44:11 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 37009 36932 0 Dec07 ? 00:05:02 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Yes, we tried ping -t from that host, no response received actually. We also have the same configuration on another host running old Nagios Core, this problem is not present.
We are sure these alarms are false, we know that the host is unreachable (cable unplugged) and we receive notifications that host is up, after a few minutes (first next check) host is seen as down, as it should be.
Re: Host recovery emails being sent while host is unreachabl
I believe the multiple processes is affecting this. You have 3 nagios processes that are all spawned from PID 1, when workers should spawn from the actual PID. I would run a killall for nagios, and then start just a single one with service nagios start
From there, run a ps -ef | grep nagios.cfg again, and you should see only two running like this -
From there, run a ps -ef | grep nagios.cfg again, and you should see only two running like this -
Code: Select all
nagios 5353 1 0 Dec13 ? 00:10:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 5429 5353 0 Dec13 ? 00:02:30 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfgFormer Nagios Employee
Re: Host recovery emails being sent while host is unreachabl
Now we have only 2 processes, as you explained, but the problem is still present.
System profile page from Nagios XI is copied into file attached.
Thread below seems similar, could you check what was resolution of that problem?
https://support.nagios.com/forum/viewto ... ifications
System profile page from Nagios XI is copied into file attached.
Thread below seems similar, could you check what was resolution of that problem?
https://support.nagios.com/forum/viewto ... ifications
You do not have the required permissions to view the files attached to this post.
Re: Host recovery emails being sent while host is unreachabl
Can you please attach the nagios.log from 12-27-2016? It'll be located at /usr/local/nagios/var/archives/nagios-12-27-2016-00.log.
I'd like to see what the system is reporting at that time through the log file.
I'd like to see what the system is reporting at that time through the log file.
Former Nagios Employee
Re: Host recovery emails being sent while host is unreachabl
Here is nagios log for December 27th 2016.
You do not have the required permissions to view the files attached to this post.
Re: Host recovery emails being sent while host is unreachabl
It does indeed look like it detected the host alert change -
It could be a bug in the plugins - what version are you running?
Code: Select all
[1482868004] HOST ALERT: TLC_Bankomat Upravna zgrada 3. sprat;UP;HARD;5;OK - 172.21.228.210: rta 4.244ms, lost 20%
Code: Select all
/usr/local/nagios/libexec/check_ping -V
/usr/local/nagios/libexec/check_icmp -V
Former Nagios Employee