Nagios stops working

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
odino2016
Posts: 9
Joined: Mon Jun 20, 2016 3:53 am

Nagios stops working

Post by odino2016 »

Our Nagios XI instance stopped updating services and hosts status at 11:30 pm.
After a lot of troubleshooting we powered on a VM backup taken a few hours before at 8pm. It showed the same problem. No check executed. This was unexpected because at 8.30pm Nagios was working correctly. The monitoring engine was up but the event queue was empty. Then we reverted the date and time back to the 8:30pm and Nagios started to execute checks and update service status ! Then we set date and time to the current time and Nagios stopped working again.
Finally we found a similar case here https://support.nagios.com/forum/viewto ... =7&t=37028
We removed retention.dat, object.cache and object.precache files and started Nagios. Everything worked correctly since then on.
So the problem is probably solved but we would like to receive some advice about what maybe happened.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Nagios stops working

Post by npolovenko »

Hello, @odino2016. Changing the time might've caused nagios to write scheduled check entries with wrong timestamps in the retention.dat file. I don't have enough information to make a certain conclusion on what caused the issue. Knowing the system time/php time values when Nagios stopped working would've been helpful.

In XI web interface please go to the Admin menu, then click on the System Profile in the left column and then click on "View System Info".
Find the paragraph that says Date/Time and copy paste all the time information listed underneath.
We need to make sure that: a) The php timezone is correct, b) the time is correct and c) the PHP time needs to exactly match the system time.

Also, when the Nagios process stops next time please download the system profile zip and upload it here. That way I can take a look at some current log files. Most likely they will contain some errors that will help us troubleshoot this problem.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
odino2016
Posts: 9
Joined: Mon Jun 20, 2016 3:53 am

Re: Nagios stops working

Post by odino2016 »

Thanks for replying.
I will follow your suggestions now and if it will happen again.
As for the time sync. The system was up and running since 300 days and has 2 local Stratum 1 NTP servers as reference and never experienced a time shift (but I will double check this).
I saved a few files for post mortem analysis.
At Sep 14 23:58:15 UTC 2018 the retention.dat reported that it was created at Sep 14 23:29:53 UTC 2018 and the most recent last_check was at the very same time Sep 14 23:29:53 UTC 2018. The higher next_check was Sep 14 23:39:53 UTC 2018.
So at this time probably Nagios was already not executing checks and not updating retention.dat

btw, system and php times are in sync
Date/Time
PHP Timezone: UTC
PHP Time: Tue, 18 Sep 2018 10:51:58 +0000
System Time: Tue, 18 Sep 2018 10:51:58 +0000
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: Nagios stops working

Post by npolovenko »

@odino2016, I agree. Looks like Nagios already stopped scheduling checks at that point.
Feel free PM me with your system profile if this issue comes up again.
*When/if you send me a PM please post something in the forum thread to bring it back up in the support queue.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Nagios stops working

Post by ssax »

Ticket received, locking this post, we will continue the support in the ticket.
Locked