Page 1 of 3
Scheduling queue freezes
Posted: Fri Dec 18, 2015 3:37 am
by aisebouma
We are running Nagios Core 4.1.1 On Ubuntu Server 14.04 running on Vmware. About once a month the scheduling queue freezes and even restarting nagios does not resolve it, I also have to delete retention.dat.
The log suddenly shows:
[1449015630] Warning: A system time change of 1507 seconds (0d 0h 25m 7s forwards in time) has been detected. Compensating...
[1449040674] Warning: A system time change of 25044 seconds (0d 6h 57m 24s forwards in time) has been detected. Compensating...
[1449065417] Warning: A system time change of 24743 seconds (0d 6h 52m 23s forwards in time) has been detected. Compensating...
[1449087641] Warning: A system time change of 22224 seconds (0d 6h 10m 24s forwards in time) has been detected. Compensating...
[1449112824] Warning: A system time change of 25183 seconds (0d 6h 59m 43s forwards in time) has been detected. Compensating...
[1449134987] Warning: A system time change of 22163 seconds (0d 6h 9m 23s forwards in time) has been detected. Compensating...
...
The server uses NTP to keep the clock up to date and also the hardware clock shows no large deviation.
Re: Scheduling queue freezes
Posted: Fri Dec 18, 2015 10:22 am
by rkennedy
What resources do you have allocated to this virtual machine? How many hosts / service checks are running?
Re: Scheduling queue freezes
Posted: Fri Dec 18, 2015 11:07 am
by aisebouma
rkennedy wrote:What resources do you have allocated to this virtual machine? How many hosts / service checks are running?
3GB memory, more then enough diskspace and 1 processor. The average CPU load is 10%.
It checks 66 hosts and 873 services
Re: Scheduling queue freezes
Posted: Fri Dec 18, 2015 2:52 pm
by hsmith
Which logs are you checking for information when this happens?
Re: Scheduling queue freezes
Posted: Fri Dec 18, 2015 2:53 pm
by jolson
Are you running pnp4nagios on this server?
If not, these time deviations are abnormal.
Nagios just detects the system time change, but has no control over actually changing it. This is _almost certainly_ NTP changing the time of your system for one reason or another.
Re: Scheduling queue freezes
Posted: Mon Dec 21, 2015 7:32 am
by aisebouma
hsmith wrote:Which logs are you checking for information when this happens?
/usr/local/nagios/var/nagios.log
Re: Scheduling queue freezes
Posted: Mon Dec 21, 2015 7:34 am
by aisebouma
jolson wrote:Are you running pnp4nagios on this server?
If not, these time deviations are abnormal.
Nagios just detects the system time change, but has no control over actually changing it. This is _almost certainly_ NTP changing the time of your system for one reason or another.
No I am not running pnp4nagios.
First of all a 1500 seconds time deviation is not normal for ntp.
Second, why would nagios stop processing the scheduled queue after a time change?
Third, why is it not robust enough to restart processing the queue when nagios is restarted?
Re: Scheduling queue freezes
Posted: Mon Dec 21, 2015 1:25 pm
by rkennedy
Can you post the result of the following command for us to look at? ntpstat
Re: Scheduling queue freezes
Posted: Tue Dec 22, 2015 5:23 am
by aisebouma
rkennedy wrote:Can you post the result of the following command for us to look at? ntpstat
Sure:
root@tibet:~# ntpstat
synchronised to NTP server (10.116.11.1) at stratum 4
time correct to within 388 ms
polling server every 1024 s
Re: Scheduling queue freezes
Posted: Tue Dec 22, 2015 4:38 pm
by tmcdonald
aisebouma wrote:First of all a 1500 seconds time deviation is not normal for ntp.
We agree,
@jolson mentioned this above.
aisebouma wrote:Second, why would nagios stop processing the scheduled queue after a time change?
Nagios uses in-memory timestamps to determine when the next checks should be run, notifications be sent, etc. When this skews from the system clock, you can see the "Compensating" message above. All these timestamps are periodically written to disk in retention.dat which is used to store state between reboots. This is why deleting that file causes the queue to refresh.
aisebouma wrote:Third, why is it not robust enough to restart processing the queue when nagios is restarted?
As mentioned above, the retention.dat file is what causes state to be restored on a reboot, and if the times saved in that file are off then the time will be off when nagios restarts. This is a drawback to be sure, but the alternative is that nothing gets saved on a restart and everything essentially gets re-checked from square one. This can be configured on or off, but the benefits far outweigh the drawbacks. At any rate, a little skew over time is usually dealt with in stride, but 6 hours suddenly is harder to deal with consistently.
We did see another user recently have a similar issue, but his was every weekend and the clock skew was not nearly as consistent (yours seems to be at least in the 6h range). We still haven't found what caused it on his system, and they did a lot of work up-front on their end to rule things out. A couple things I would like to ask of you:
Are you running mod_gearman?
How precise is the "once a month" estimate? Is it on a particular calendar day? Week day?
Does anything in your environment/network occur around the time that this happens?