Service recovery emails generated without status change

rferebee · Post by **rferebee** » Wed Apr 17, 2019 10:44 am

Ok, those new commands ran successfully, thank you.

Attached is the updated State History.

Also, I PM'd you an updated System Profile. Thank you!

npolovenko · Post by **npolovenko** » Wed Apr 17, 2019 11:49 am

@rferebee, I looked at your old apache logs and it seems like you had some timing issues in the past which were fixed just recently.

PHP Warning: strtotime(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier.

That could've caused timing issues with host and service checks and cause these email notifications.

rferebee · Post by **rferebee** » Wed Apr 17, 2019 12:27 pm

So, the Recovery notifications went out again this morning. All different service checks from the ones yesterday.

I appears to happen during or right after we run our fail over database restore. We have a job configured to run everyday at 10AM that writes the config from our production server to our fail over mirror server. Somehow that is triggering these notifications to go out.

rferebee · Post by **rferebee** » Wed Apr 17, 2019 12:50 pm

Actually, I was incorrect. It's not the fail over restore that runs at 10AM. It's a system reboot for a "memory leak issue" (which I don't know if it's even an issue anymore). Here's the contents of failover_event_handler.sh

root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x

I think all it does is reboot our environment to test that fail over is working, but I'm not 100% sure. When this finishes around 10:11AM is when all the Service Recovery emails get sent out.

npolovenko · Post by **npolovenko** » Wed Apr 17, 2019 1:41 pm

@rferebee, I don't see anything out of the ordinary in this script. Can you generate a state history report(in PDF) for today for all service checks, and another report in PDF for all notifications today for all service checks as well and upload both reports in this thread? Or pm them to me?

Reports shared internally with the support team.

rferebee · Post by **rferebee** » Wed Apr 17, 2019 3:19 pm

PM sent, thank you!

npolovenko · Post by **npolovenko** » Thu Apr 18, 2019 11:18 am

@rferebee, Thank you. I saw a bunch of service recoveries in the morning but didn't see services going into critical states before that. On your previous report for the pug host, I also noticed some discrepancies that are hard to explain. It could be that you had duplicate nagios processes running at that time or retention.dat file got corrupted. Can you provide a very detailed explanation of what is happening with a primary XI server and recovery XI server at 10 am in the morning? Is the XI server restarting? Is there a script that restarts it? Is one server restoring from another and how? What files and configurations are affected during that process?
Thank you

rferebee · Post by **rferebee** » Thu Apr 18, 2019 11:50 am

Yes, I got some clarification on this yesterday.

All that's happening is a server reboot. First, the Primary XI server reboots to verify that fail over is working properly and then the Recovery XI server reboots to set the Primary XI server back to primary.

Here's the script that runs to do what's mentioned above:

root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x

Thank you!

npolovenko · Post by **npolovenko** » Thu Apr 18, 2019 4:21 pm

@rferebee, Thank you. I looked on GitHub and your issue sounds very familiar to the one described in this bug report:
https://github.com/NagiosEnterprises/na ... issues/624

It's still a bit suspicious how you get all the recovery emails right after server reboots. And it wasn't very clear based on the script you provided what exactly triggers the server reboot? Is there a way to not reboot the server for 1 day and see if the issue happens regardless? Also, If you are not afraid of loosing scheduled downtime, acknowledgments, and host/service comments I'd recommend deleting the status.dat and retention.dat files.

service nagios stop
rm -rf /usr/local/nagios/var/status.dat
rm -rf /usr/local/nagios/var/retention.dat
service nagios start

You should backup XI prior to this.

The reason why this may be a good step is that false recovery notifications could be related to corrupted counters in these files. Deleting them will force nagios to rebuild service status information and notification counters.

rferebee · Post by **rferebee** » Thu Apr 18, 2019 4:28 pm

Yes, we can definitely stop the server from rebooting for a day to test.

Let me speak with my team to see if we're comfortable with the second part of this update. I'll let you know.

I'll let you know tomorrow what not letting the server reboot does, if anything. Thank you!

Nagios Support Forum

Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change

Re: Service recovery emails generated without status change