I am having an issue with a secondary nagiosxi server for high availability. My goal is to turn off the monitoring agent until the primary server goes offline. Once the primary is offline, the floating ip switches to the secondary and is set as the source address and then it will turn on the monitoring agent off for the primary (if possible) and flip it on for the secondary. The problem is that the monitoring agent randomly decides to start back up even after stopping it manually. Are there any suggestions to fix this? I am controlling the stop and start of the nagios monitoring agent using systemd commands. The logs don't show anything for nagios.log. I can't seem to find what log might show something trying to restart the monitoring agent. I am assuming that it is being restarted by one of the systemd nagios sessions.
Hello, @miwalls. Have you submitted any Apply Configuration commands after you stopped the nagios process? Or any external commands? Has the server been rebooted?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
I actually figured out the issue. I setup high availability using the typical pcsd with a virtual ip resource for both incoming and source and setup a systemd resource for nagios. The way I sync the primary and secondary server is with backups sent through ssh on the primary then a restore on the secondary. The problem arises when the secondary server gets the backup schedules for the primary server. It attempts or successfully runs the backups then it flips the nagios systemd unit back on and then goes about its happy way alerting me (often in the middle of the night) about stuff that isn't happening. Would you have any recommendations on how to fix this? I was thinking I could delete the backup_xi.sh script entirely after the restore nagiosxi cron job runs on the secondary then it can't do anything. I also think a somewhat important change to the backup and possibly restore script should be to check if all the services are running before hand then restore it to the state it was at before the backup or restore.
@miwalls, Are you using the restore_xi.sh script on the secondary failover server? Could you modify the script to prevent it from automatically starting the nagios process after the restore by commenting out this line:
$BASEDIR/manage_services.sh start nagios
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Yeah what I ended up doing was adding a sed command to replace the start/restart nagios commands from backup_xi.sh and restore_xi.sh before and after the restore. It seems on restore that these files are changed back to default. The issue is now fixed. Thanks