Page 3 of 4
Re: "HOST UP" flood after Nagios Core update
Posted: Wed Jun 27, 2018 9:26 am
by madsantos
scottwilkerson wrote:Can you determine if the email are coming form this server or the server with obsess_over_services enabled?
We have a "nagios central" which does passive checks to every host on their different networks. Let's call it Central. Central is the one which sends the emails, but its "obsess_over_hosts" and "obsess_over_services" flags are 0 (as you could see in the previous post, when I sent you the code from nagios.cfg). The flag was never 1 on Central, but it was always 1 on every machine so that we know when a machine goes offline. If I delete the ochp command or turn the flag into 0 on any machine (besides Central), I will get no warnings about machine STATE changes, but if I maintain the flag at 1, as it was previous to the update, then I will get a flood of "HOST IS UP" messages every single check. I feel like the problem has more to do with the fact that Nagios does not understand that the previous STATE was UP, and it shouldn't send any more messages unless the STATE changes. You can see in the picture below that the emails start with a "RECOVERY" Notification, which makes no sense because the machine was always ON. I currently have one machine with the ohcp command enabled, which I am using to debug at the cost of receiving one message per 5 minutes:
https://imgur.com/a/LCnO2fE
EDIT: So to answer the question, the mail comes from Central, which has no "obsess_over_services" or "obsess_over_hosts" flag enabled. It only passive checks and sends emails in case it receives STATE changes.
Re: "HOST UP" flood after Nagios Core update
Posted: Wed Jun 27, 2018 3:35 pm
by scottwilkerson
That sheds more light. Can you complete the exercise you did earlier but from the central server?
scottwilkerson wrote:would it be possible to have you peer into the objects.cached and grab the full definition for one of these hosts, and then obfuscate any sensitive info and post the definition?
Re: "HOST UP" flood after Nagios Core update
Posted: Thu Jun 28, 2018 4:33 am
by madsantos
Hi and thanks for your patience, hopefully we can find the cause of this behavior. Below is the code referring to the only host I chose to maintain the "obsess_over_hosts" at 1 and whose "recovery" emails are flooding my inbox.
Code: Select all
define host {
host_name some_host
alias Some Host
address some_address
check_period 24x7
check_command check-host-alive
contact_groups admins
notification_period 24x7
initial_state o
importance 0
check_interval 5.000000
retry_interval 1.000000
max_check_attempts 10
active_checks_enabled 0
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,d
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
}
Re: "HOST UP" flood after Nagios Core update
Posted: Thu Jun 28, 2018 10:15 am
by scottwilkerson
I honestly do see why this is happening.
Just to rule one other thing out, on the central server you do just have 1 nagios parent process running correct?
Re: "HOST UP" flood after Nagios Core update
Posted: Thu Jun 28, 2018 11:55 am
by madsantos
scottwilkerson wrote:I honestly do see why this is happening.
Just to rule one other thing out, on the central server you do just have 1 nagios parent process running correct?
Just to clarify things a bit more, the first machine I updated was Central, which didn't affect the amount of warnings. After I updated one of the monitored machines, that's when I started receiving all the alerts (from that specific machine, all outdated machines are fine). So I can't tell if this would happen if I only updated the other machines, while leaving Central outdated.
Here's the result of the filtered ps command:
Code: Select all
user@localhost:~$ ps -ef | grep nagios.cfg
user 6804 6556 0 17:41 pts/8 00:00:00 grep --color=auto nagios.cfg
nagios 98762 1 0 12:25 ? 00:00:40 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
nagios 98769 98762 0 12:25 ? 00:00:01 /usr/local/nagios/bin/nagios /usr/local/nagios/etc/nagios.cfg
Re: "HOST UP" flood after Nagios Core update
Posted: Thu Jun 28, 2018 12:03 pm
by scottwilkerson
and you are 100% sure the central is what is sending the alerts?
Was this image on the central?
https://i.imgur.com/lm5RrG8.png
Can you look at the notification history on the central and one of the upgraded servers?
Really notifications should be disabled on the non-central machines but I digress
Re: "HOST UP" flood after Nagios Core update
Posted: Fri Jun 29, 2018 4:42 am
by madsantos
scottwilkerson wrote:and you are 100% sure the central is what is sending the alerts?
Was this image on the central?
https://i.imgur.com/lm5RrG8.png
Can you look at the notification history on the central and one of the upgraded servers?
Really notifications should be disabled on the non-central machines but I digress
I'm 100% sure that it is indeed the central that is sending the alerts and I can confirm that the image corresponds to the Central notification area. I've double checked notifications on the "debug machine" and there are no related entries on the list.
To sum up: Central floods my email and smartphone with "host up" alerts and you can check their history on Central's notifications area. The other upgraded machines do not send those alerts and consequently there are no entries on their notifications area.
Re: "HOST UP" flood after Nagios Core update
Posted: Fri Jun 29, 2018 2:16 pm
by scottwilkerson
At present I don't have anything more to suggest other than to simply roll back the servers you upgraded to 4.2.0
as we cannot replicate the issue, and have and no other reports of same or similar happening
4.2.0 may be downloaded here
https://assets.nagios.com/downloads/nag ... 2.0.tar.gz
Re: "HOST UP" flood after Nagios Core update
Posted: Mon Jul 02, 2018 5:01 am
by madsantos
scottwilkerson wrote:At present I don't have anything more to suggest other than to simply roll back the servers you upgraded to 4.2.0
as we cannot replicate the issue, and have and no other reports of same or similar happening
4.2.0 may be downloaded here
https://assets.nagios.com/downloads/nag ... 2.0.tar.gz
I downgraded Central to 4.2.0 but managed to keep the other machines updated to 4.4.1. Everything is acting as it should now.
Thank you for your time and effort, cheers
Re: "HOST UP" flood after Nagios Core update
Posted: Mon Jul 02, 2018 10:56 am
by scottwilkerson
Well, it sound like for some reason the issue is just with the central.
If we every can replicate this I will post back here.