Our Nagios XI instance stopped updating services and hosts status at 11:30 pm.
After a lot of troubleshooting we powered on a VM backup taken a few hours before at 8pm. It showed the same problem. No check executed. This was unexpected because at 8.30pm Nagios was working correctly. The monitoring engine was up but the event queue was empty. Then we reverted the date and time back to the 8:30pm and Nagios started to execute checks and update service status ! Then we set date and time to the current time and Nagios stopped working again.
Finally we found a similar case here https://support.nagios.com/forum/viewto ... =7&t=37028
We removed retention.dat, object.cache and object.precache files and started Nagios. Everything worked correctly since then on.
So the problem is probably solved but we would like to receive some advice about what maybe happened.
Nagios stops working
-
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Nagios stops working
Hello, @odino2016. Changing the time might've caused nagios to write scheduled check entries with wrong timestamps in the retention.dat file. I don't have enough information to make a certain conclusion on what caused the issue. Knowing the system time/php time values when Nagios stopped working would've been helpful.
In XI web interface please go to the Admin menu, then click on the System Profile in the left column and then click on "View System Info".
Find the paragraph that says Date/Time and copy paste all the time information listed underneath.
We need to make sure that: a) The php timezone is correct, b) the time is correct and c) the PHP time needs to exactly match the system time.
Also, when the Nagios process stops next time please download the system profile zip and upload it here. That way I can take a look at some current log files. Most likely they will contain some errors that will help us troubleshoot this problem.
In XI web interface please go to the Admin menu, then click on the System Profile in the left column and then click on "View System Info".
Find the paragraph that says Date/Time and copy paste all the time information listed underneath.
We need to make sure that: a) The php timezone is correct, b) the time is correct and c) the PHP time needs to exactly match the system time.
Also, when the Nagios process stops next time please download the system profile zip and upload it here. That way I can take a look at some current log files. Most likely they will contain some errors that will help us troubleshoot this problem.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Nagios stops working
Thanks for replying.
I will follow your suggestions now and if it will happen again.
As for the time sync. The system was up and running since 300 days and has 2 local Stratum 1 NTP servers as reference and never experienced a time shift (but I will double check this).
I saved a few files for post mortem analysis.
At Sep 14 23:58:15 UTC 2018 the retention.dat reported that it was created at Sep 14 23:29:53 UTC 2018 and the most recent last_check was at the very same time Sep 14 23:29:53 UTC 2018. The higher next_check was Sep 14 23:39:53 UTC 2018.
So at this time probably Nagios was already not executing checks and not updating retention.dat
btw, system and php times are in sync
Date/Time
PHP Timezone: UTC
PHP Time: Tue, 18 Sep 2018 10:51:58 +0000
System Time: Tue, 18 Sep 2018 10:51:58 +0000
I will follow your suggestions now and if it will happen again.
As for the time sync. The system was up and running since 300 days and has 2 local Stratum 1 NTP servers as reference and never experienced a time shift (but I will double check this).
I saved a few files for post mortem analysis.
At Sep 14 23:58:15 UTC 2018 the retention.dat reported that it was created at Sep 14 23:29:53 UTC 2018 and the most recent last_check was at the very same time Sep 14 23:29:53 UTC 2018. The higher next_check was Sep 14 23:39:53 UTC 2018.
So at this time probably Nagios was already not executing checks and not updating retention.dat
btw, system and php times are in sync
Date/Time
PHP Timezone: UTC
PHP Time: Tue, 18 Sep 2018 10:51:58 +0000
System Time: Tue, 18 Sep 2018 10:51:58 +0000
-
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Nagios stops working
@odino2016, I agree. Looks like Nagios already stopped scheduling checks at that point.
Feel free PM me with your system profile if this issue comes up again.
*When/if you send me a PM please post something in the forum thread to bring it back up in the support queue.
Feel free PM me with your system profile if this issue comes up again.
*When/if you send me a PM please post something in the forum thread to bring it back up in the support queue.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Nagios stops working
Ticket received, locking this post, we will continue the support in the ticket.