Page 1 of 1
Delays for hosts coming out of scheduled downtime
Posted: Wed Mar 17, 2021 3:09 pm
by jvaira
Hello,
It was brought to my attention the other day that when a user put about 100 hosts into scheduled downtime there was a delay in removing the downtime when the end time was reached. For example the downtime was scheduled from 10:45 - 11:45 but in the event log I could see some of those hosts still exiting from downtime until about 12:30. Other users also reported seeing scheduled downtime acknowledgments on those hosts well past the 11:45 end time. I was able to recreate this even with a small group of 10 hosts. Although the delay was not as significant some of them did exit downtime about a minute past the end time. Is there a reason why the hosts cannot just exit downtime all at the same time?
Re: Delays for hosts coming out of scheduled downtime
Posted: Thu Mar 18, 2021 10:58 am
by lmiltchev
We are going to try to recreate the issue in house even though I suspect this is something normal. Hosts exit downtime when there is a check result. So, there could be some delay depending on when the next check occurs. The delay could be bigger on "busy" systems with lots of checks and some latency. It would be expected to see a delay of a minute up to the check_interval time, even on a "healthy" system.
Are you scheduling fixed or flexible downtime?
Re: Delays for hosts coming out of scheduled downtime
Posted: Thu Mar 18, 2021 11:51 am
by jvaira
Hello,
I did not know that the exit downtime did not occur until there was a check result so that would explain the minute delays on the smaller group but 45 minutes seems like a long time for a hundred hosts. Assuming this same logic would apply to schedule downtime for service checks as well? For example if the check interval on a specific service check was 30 minutes there is a potential that the downtime for that service check would not clear until 30 minutes after downtime ends? To answer your question we are scheduling fixed downtime.
Thank you
Re: Delays for hosts coming out of scheduled downtime
Posted: Fri Mar 19, 2021 9:59 am
by lmiltchev
I haven't been able to reproduce the issue. I tested this a couple of times. Yesterday, I tested it with a small number of objects, but today I placed about 800 objects in fixed, scheduled downtime at the same time. All of the objects were supposed to exit downtime at 9:30 AM and all of them did. One of the services even exited a second earlier.
Code: Select all
[root@main-nagios-xi nagiosxi]# grep 'DOWNTIME' /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e' | grep 'Fri Mar 19' | grep '08:52' | grep 'STARTED' | wc -l
801
[root@main-nagios-xi nagiosxi]# grep 'DOWNTIME' /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e' | grep 'Fri Mar 19' | grep '09:30' | grep 'STOPPED' | wc -l
800
You have mail in /var/spool/mail/root
[root@main-nagios-xi nagiosxi]# grep 'DOWNTIME' /usr/local/nagios/var/nagios.log | perl -pe 's/(\d+)/localtime($1)/e' | grep 'Fri Mar 19' | grep -v '09:30' | grep 'STOPPED'
[Fri Mar 19 09:29:59 2021] SERVICE DOWNTIME ALERT: Tech Switch;Port 7 Status;STOPPED; Service has exited from a period of scheduled downtime
I believe you have to open a support ticket via our support center here:
https://support.nagios.com/tickets/
and provide our support techs with your latest profile (Admin > System Config > System Profile > Download Profile). We would need to review your configs and logs in order to further troubleshoot the issue.
Re: Delays for hosts coming out of scheduled downtime
Posted: Mon Mar 22, 2021 10:28 am
by jvaira
Hello,
After doing some further testing I realized that this is only occurring on our instance that is on version 5.7.5. Our other larger instances are still on older versions due to the fact that we ran into performance issues with NDO 3. In the update release notes I have seen a few modifications that have been made to NDO 3 since 5.7.5. Could this potentially be caused by NDO 3 performance issues in 5.7.5?
Re: Delays for hosts coming out of scheduled downtime
Posted: Mon Mar 22, 2021 5:34 pm
by ssax
You could be hitting a bug in NDO3, I would try upgrading to XI 5.8.2 which has the latest NDO3 fixes and see if that resolves it.
Otherwise, you can do this to downgrade NDO3 back to NDO2DB (the XI version stays the same), apply configuration, and then see if the issue is resolved:
Run these commands as root:
Code: Select all
systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi
./init.sh
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db
If you have an offloaded database you will need to edit your
/usr/local/nagios/etc/ndo2db.cfg file and update these before running the next command to start it up:
- You can get the info from your
/usr/local/nagios/etc/ndo.cfg or from
/usr/local/nagiosxi/html/config.inc.php
Then run this command to start it up.:
Then edit your
/usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented/add it if needed:
Code: Select all
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
Make sure all occurrences of this line are commented:
Code: Select all
#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
Then start the nagios service:
Then apply config and validate.