Scheduling queue freezes

aisebouma · Post by **aisebouma** » Fri Dec 18, 2015 3:37 am

We are running Nagios Core 4.1.1 On Ubuntu Server 14.04 running on Vmware. About once a month the scheduling queue freezes and even restarting nagios does not resolve it, I also have to delete retention.dat.

The log suddenly shows:
[1449015630] Warning: A system time change of 1507 seconds (0d 0h 25m 7s forwards in time) has been detected. Compensating...
[1449040674] Warning: A system time change of 25044 seconds (0d 6h 57m 24s forwards in time) has been detected. Compensating...
[1449065417] Warning: A system time change of 24743 seconds (0d 6h 52m 23s forwards in time) has been detected. Compensating...
[1449087641] Warning: A system time change of 22224 seconds (0d 6h 10m 24s forwards in time) has been detected. Compensating...
[1449112824] Warning: A system time change of 25183 seconds (0d 6h 59m 43s forwards in time) has been detected. Compensating...
[1449134987] Warning: A system time change of 22163 seconds (0d 6h 9m 23s forwards in time) has been detected. Compensating...
...

The server uses NTP to keep the clock up to date and also the hardware clock shows no large deviation.

rkennedy · Post by **rkennedy** » Fri Dec 18, 2015 10:22 am

What resources do you have allocated to this virtual machine? How many hosts / service checks are running?

aisebouma · Post by **aisebouma** » Fri Dec 18, 2015 11:07 am

rkennedy wrote:What resources do you have allocated to this virtual machine? How many hosts / service checks are running?

3GB memory, more then enough diskspace and 1 processor. The average CPU load is 10%.

It checks 66 hosts and 873 services

hsmith · Post by **hsmith** » Fri Dec 18, 2015 2:52 pm

Which logs are you checking for information when this happens?

jolson · Post by **jolson** » Fri Dec 18, 2015 2:53 pm

Are you running pnp4nagios on this server?

Code: Select all

ps -ef | grep npcd

If not, these time deviations are abnormal.

Nagios just detects the system time change, but has no control over actually changing it. This is _almost certainly_ NTP changing the time of your system for one reason or another.

aisebouma · Post by **aisebouma** » Mon Dec 21, 2015 7:32 am

hsmith wrote:Which logs are you checking for information when this happens?

/usr/local/nagios/var/nagios.log

aisebouma · Post by **aisebouma** » Mon Dec 21, 2015 7:34 am

jolson wrote:Are you running pnp4nagios on this server?
Code: Select all
ps -ef | grep npcd
If not, these time deviations are abnormal.

Nagios just detects the system time change, but has no control over actually changing it. This is _almost certainly_ NTP changing the time of your system for one reason or another.

No I am not running pnp4nagios.

First of all a 1500 seconds time deviation is not normal for ntp.

Second, why would nagios stop processing the scheduled queue after a time change?

Third, why is it not robust enough to restart processing the queue when nagios is restarted?

rkennedy · Post by **rkennedy** » Mon Dec 21, 2015 1:25 pm

Can you post the result of the following command for us to look at? ntpstat

aisebouma · Post by **aisebouma** » Tue Dec 22, 2015 5:23 am

rkennedy wrote:Can you post the result of the following command for us to look at? ntpstat

Sure:

root@tibet:~# ntpstat
synchronised to NTP server (10.116.11.1) at stratum 4
time correct to within 388 ms
polling server every 1024 s

tmcdonald · Post by **tmcdonald** » Tue Dec 22, 2015 4:38 pm

aisebouma wrote:First of all a 1500 seconds time deviation is not normal for ntp.

We agree, @jolson mentioned this above.

aisebouma wrote:Second, why would nagios stop processing the scheduled queue after a time change?

Nagios uses in-memory timestamps to determine when the next checks should be run, notifications be sent, etc. When this skews from the system clock, you can see the "Compensating" message above. All these timestamps are periodically written to disk in retention.dat which is used to store state between reboots. This is why deleting that file causes the queue to refresh.

aisebouma wrote:Third, why is it not robust enough to restart processing the queue when nagios is restarted?

As mentioned above, the retention.dat file is what causes state to be restored on a reboot, and if the times saved in that file are off then the time will be off when nagios restarts. This is a drawback to be sure, but the alternative is that nothing gets saved on a restart and everything essentially gets re-checked from square one. This can be configured on or off, but the benefits far outweigh the drawbacks. At any rate, a little skew over time is usually dealt with in stride, but 6 hours suddenly is harder to deal with consistently.

We did see another user recently have a similar issue, but his was every weekend and the clock skew was not nearly as consistent (yours seems to be at least in the 6h range). We still haven't found what caused it on his system, and they did a lot of work up-front on their end to rule things out. A couple things I would like to ask of you:

Are you running mod_gearman?
How precise is the "once a month" estimate? Is it on a particular calendar day? Week day?
Does anything in your environment/network occur around the time that this happens?

Nagios Support Forum

Scheduling queue freezes

Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes