Ok, those new commands ran successfully, thank you.
Attached is the updated State History.
Also, I PM'd you an updated System Profile. Thank you!
Service recovery emails generated without status change
Re: Service recovery emails generated without status change
You do not have the required permissions to view the files attached to this post.
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Service recovery emails generated without status change
@rferebee, I looked at your old apache logs and it seems like you had some timing issues in the past which were fixed just recently.
That could've caused timing issues with host and service checks and cause these email notifications.PHP Warning: strtotime(): It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Service recovery emails generated without status change
So, the Recovery notifications went out again this morning. All different service checks from the ones yesterday.
I appears to happen during or right after we run our fail over database restore. We have a job configured to run everyday at 10AM that writes the config from our production server to our fail over mirror server. Somehow that is triggering these notifications to go out.
I appears to happen during or right after we run our fail over database restore. We have a job configured to run everyday at 10AM that writes the config from our production server to our fail over mirror server. Somehow that is triggering these notifications to go out.
Re: Service recovery emails generated without status change
Actually, I was incorrect. It's not the fail over restore that runs at 10AM. It's a system reboot for a "memory leak issue" (which I don't know if it's even an issue anymore). Here's the contents of failover_event_handler.sh
root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x
I think all it does is reboot our environment to test that fail over is working, but I'm not 100% sure. When this finishes around 10:11AM is when all the Service Recovery emails get sent out.
root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x
I think all it does is reboot our environment to test that fail over is working, but I'm not 100% sure. When this finishes around 10:11AM is when all the Service Recovery emails get sent out.
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Service recovery emails generated without status change
@rferebee, I don't see anything out of the ordinary in this script. Can you generate a state history report(in PDF) for today for all service checks, and another report in PDF for all notifications today for all service checks as well and upload both reports in this thread? Or pm them to me?
Reports shared internally with the support team.
Reports shared internally with the support team.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Service recovery emails generated without status change
PM sent, thank you!
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Service recovery emails generated without status change
@rferebee, Thank you. I saw a bunch of service recoveries in the morning but didn't see services going into critical states before that. On your previous report for the pug host, I also noticed some discrepancies that are hard to explain. It could be that you had duplicate nagios processes running at that time or retention.dat file got corrupted. Can you provide a very detailed explanation of what is happening with a primary XI server and recovery XI server at 10 am in the morning? Is the XI server restarting? Is there a script that restarts it? Is one server restoring from another and how? What files and configurations are affected during that process?
Thank you
Thank you
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Service recovery emails generated without status change
Yes, I got some clarification on this yesterday.
All that's happening is a server reboot. First, the Primary XI server reboots to verify that fail over is working properly and then the Recovery XI server reboots to set the Primary XI server back to primary.
Here's the script that runs to do what's mentioned above:
root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x
Thank you!
All that's happening is a server reboot. First, the Primary XI server reboots to verify that fail over is working properly and then the Recovery XI server reboots to set the Primary XI server back to primary.
Here's the script that runs to do what's mentioned above:
root@nagiosxi> cat failover_event_handler.sh
#!/bin/bash
#set -x
#exec 5> /tmp/debug_output.txt
#BASH_XTRACEFD="5"
HOSTNAME=$1
HOSTSTATE=$2
ns1=$3
ns2=$4
#Verify host
if [ $HOSTNAME != 'nagiosxi_monitor' ]
then exit
fi
#Case on host state
case "$HOSTSTATE" in
UP)
#Turn failover notifications off if host returning to up state
curl -d "cmd_mod=2&cmd_typ=11" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
;;
DOWN)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually down
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxx"
exit 0
else
exit 0
fi
;;
UNREACHABLE)
if [ $ns1 == 'UP' ] || [ $ns2 == 'UP' ]
then
#Turn failover notifications on if host is actually unreachable
curl -d "cmd_mod=2&cmd_typ=12" "http://10.231.86.58/nagios/cgi-bin/cmd.cgi" -u "nagios:xxxxxxxx"
exit 0
else
exit 0
fi
;;
esac
#set +x
Thank you!
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Service recovery emails generated without status change
@rferebee, Thank you. I looked on GitHub and your issue sounds very familiar to the one described in this bug report:
https://github.com/NagiosEnterprises/na ... issues/624
It's still a bit suspicious how you get all the recovery emails right after server reboots. And it wasn't very clear based on the script you provided what exactly triggers the server reboot? Is there a way to not reboot the server for 1 day and see if the issue happens regardless? Also, If you are not afraid of loosing scheduled downtime, acknowledgments, and host/service comments I'd recommend deleting the status.dat and retention.dat files.
The reason why this may be a good step is that false recovery notifications could be related to corrupted counters in these files. Deleting them will force nagios to rebuild service status information and notification counters.
https://github.com/NagiosEnterprises/na ... issues/624
It's still a bit suspicious how you get all the recovery emails right after server reboots. And it wasn't very clear based on the script you provided what exactly triggers the server reboot? Is there a way to not reboot the server for 1 day and see if the issue happens regardless? Also, If you are not afraid of loosing scheduled downtime, acknowledgments, and host/service comments I'd recommend deleting the status.dat and retention.dat files.
You should backup XI prior to this.service nagios stop
rm -rf /usr/local/nagios/var/status.dat
rm -rf /usr/local/nagios/var/retention.dat
service nagios start
The reason why this may be a good step is that false recovery notifications could be related to corrupted counters in these files. Deleting them will force nagios to rebuild service status information and notification counters.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Service recovery emails generated without status change
Yes, we can definitely stop the server from rebooting for a day to test.
Let me speak with my team to see if we're comfortable with the second part of this update. I'll let you know.
I'll let you know tomorrow what not letting the server reboot does, if anything. Thank you!
Let me speak with my team to see if we're comfortable with the second part of this update. I'll let you know.
I'll let you know tomorrow what not letting the server reboot does, if anything. Thank you!