Service recovery emails generated without status change

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Service recovery emails generated without status change

Post by rferebee »

Ok, those new commands ran successfully, thank you.

Attached is the updated State History.

Also, I PM'd you an updated System Profile. Thank you!
You do not have the required permissions to view the files attached to this post.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Service recovery emails generated without status change

Post by npolovenko »

@rferebee, I looked at your old apache logs and it seems like you had some timing issues in the past which were fixed just recently.
PHP Warning: strtotime(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier.
That could've caused timing issues with host and service checks and cause these email notifications.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Service recovery emails generated without status change

Post by rferebee »

So, the Recovery notifications went out again this morning. All different service checks from the ones yesterday.

I appears to happen during or right after we run our fail over database restore. We have a job configured to run everyday at 10AM that writes the config from our production server to our fail over mirror server. Somehow that is triggering these notifications to go out.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Service recovery emails generated without status change

Post by rferebee »

Actually, I was incorrect. It's not the fail over restore that runs at 10AM. It's a system reboot for a "memory leak issue" (which I don't know if it's even an issue anymore). Here's the contents of failover_event_handler.sh

root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x


I think all it does is reboot our environment to test that fail over is working, but I'm not 100% sure. When this finishes around 10:11AM is when all the Service Recovery emails get sent out.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Service recovery emails generated without status change

Post by npolovenko »

@rferebee, I don't see anything out of the ordinary in this script. Can you generate a state history report(in PDF) for today for all service checks, and another report in PDF for all notifications today for all service checks as well and upload both reports in this thread? Or pm them to me?

Reports shared internally with the support team.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Service recovery emails generated without status change

Post by rferebee »

PM sent, thank you!
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Service recovery emails generated without status change

Post by npolovenko »

@rferebee, Thank you. I saw a bunch of service recoveries in the morning but didn't see services going into critical states before that. On your previous report for the pug host, I also noticed some discrepancies that are hard to explain. It could be that you had duplicate nagios processes running at that time or retention.dat file got corrupted. Can you provide a very detailed explanation of what is happening with a primary XI server and recovery XI server at 10 am in the morning? Is the XI server restarting? Is there a script that restarts it? Is one server restoring from another and how? What files and configurations are affected during that process?
Thank you
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Service recovery emails generated without status change

Post by rferebee »

Yes, I got some clarification on this yesterday.

All that's happening is a server reboot. First, the Primary XI server reboots to verify that fail over is working properly and then the Recovery XI server reboots to set the Primary XI server back to primary.

Here's the script that runs to do what's mentioned above:

root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x


Thank you!
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Service recovery emails generated without status change

Post by npolovenko »

@rferebee, Thank you. I looked on GitHub and your issue sounds very familiar to the one described in this bug report:
https://github.com/NagiosEnterprises/na ... issues/624

It's still a bit suspicious how you get all the recovery emails right after server reboots. And it wasn't very clear based on the script you provided what exactly triggers the server reboot? Is there a way to not reboot the server for 1 day and see if the issue happens regardless? Also, If you are not afraid of loosing scheduled downtime, acknowledgments, and host/service comments I'd recommend deleting the status.dat and retention.dat files.
service nagios stop
rm -rf /usr/local/nagios/var/status.dat
rm -rf /usr/local/nagios/var/retention.dat
service nagios start
You should backup XI prior to this.

The reason why this may be a good step is that false recovery notifications could be related to corrupted counters in these files. Deleting them will force nagios to rebuild service status information and notification counters.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
rferebee
Posts: 733
Joined: Wed Jul 11, 2018 11:37 am

Re: Service recovery emails generated without status change

Post by rferebee »

Yes, we can definitely stop the server from rebooting for a day to test.

Let me speak with my team to see if we're comfortable with the second part of this update. I'll let you know.

I'll let you know tomorrow what not letting the server reboot does, if anything. Thank you!
Locked