Scheduling queue freezes
Scheduling queue freezes
We are running Nagios Core 4.1.1 On Ubuntu Server 14.04 running on Vmware. About once a month the scheduling queue freezes and even restarting nagios does not resolve it, I also have to delete retention.dat.
The log suddenly shows:
[1449015630] Warning: A system time change of 1507 seconds (0d 0h 25m 7s forwards in time) has been detected. Compensating...
[1449040674] Warning: A system time change of 25044 seconds (0d 6h 57m 24s forwards in time) has been detected. Compensating...
[1449065417] Warning: A system time change of 24743 seconds (0d 6h 52m 23s forwards in time) has been detected. Compensating...
[1449087641] Warning: A system time change of 22224 seconds (0d 6h 10m 24s forwards in time) has been detected. Compensating...
[1449112824] Warning: A system time change of 25183 seconds (0d 6h 59m 43s forwards in time) has been detected. Compensating...
[1449134987] Warning: A system time change of 22163 seconds (0d 6h 9m 23s forwards in time) has been detected. Compensating...
...
The server uses NTP to keep the clock up to date and also the hardware clock shows no large deviation.
The log suddenly shows:
[1449015630] Warning: A system time change of 1507 seconds (0d 0h 25m 7s forwards in time) has been detected. Compensating...
[1449040674] Warning: A system time change of 25044 seconds (0d 6h 57m 24s forwards in time) has been detected. Compensating...
[1449065417] Warning: A system time change of 24743 seconds (0d 6h 52m 23s forwards in time) has been detected. Compensating...
[1449087641] Warning: A system time change of 22224 seconds (0d 6h 10m 24s forwards in time) has been detected. Compensating...
[1449112824] Warning: A system time change of 25183 seconds (0d 6h 59m 43s forwards in time) has been detected. Compensating...
[1449134987] Warning: A system time change of 22163 seconds (0d 6h 9m 23s forwards in time) has been detected. Compensating...
...
The server uses NTP to keep the clock up to date and also the hardware clock shows no large deviation.
Re: Scheduling queue freezes
What resources do you have allocated to this virtual machine? How many hosts / service checks are running?
Former Nagios Employee
Re: Scheduling queue freezes
3GB memory, more then enough diskspace and 1 processor. The average CPU load is 10%.rkennedy wrote:What resources do you have allocated to this virtual machine? How many hosts / service checks are running?
It checks 66 hosts and 873 services
Re: Scheduling queue freezes
Which logs are you checking for information when this happens?
Former Nagios Employee.
me.
me.
Re: Scheduling queue freezes
Are you running pnp4nagios on this server?
If not, these time deviations are abnormal.
Nagios just detects the system time change, but has no control over actually changing it. This is _almost certainly_ NTP changing the time of your system for one reason or another.
Code: Select all
ps -ef | grep npcd
Nagios just detects the system time change, but has no control over actually changing it. This is _almost certainly_ NTP changing the time of your system for one reason or another.
Re: Scheduling queue freezes
/usr/local/nagios/var/nagios.loghsmith wrote:Which logs are you checking for information when this happens?
Re: Scheduling queue freezes
No I am not running pnp4nagios.jolson wrote:Are you running pnp4nagios on this server?
If not, these time deviations are abnormal.Code: Select all
ps -ef | grep npcd
Nagios just detects the system time change, but has no control over actually changing it. This is _almost certainly_ NTP changing the time of your system for one reason or another.
First of all a 1500 seconds time deviation is not normal for ntp.
Second, why would nagios stop processing the scheduled queue after a time change?
Third, why is it not robust enough to restart processing the queue when nagios is restarted?
Re: Scheduling queue freezes
Can you post the result of the following command for us to look at? ntpstat
Former Nagios Employee
Re: Scheduling queue freezes
Sure:rkennedy wrote:Can you post the result of the following command for us to look at? ntpstat
root@tibet:~# ntpstat
synchronised to NTP server (10.116.11.1) at stratum 4
time correct to within 388 ms
polling server every 1024 s
Re: Scheduling queue freezes
We agree, @jolson mentioned this above.aisebouma wrote:First of all a 1500 seconds time deviation is not normal for ntp.
Nagios uses in-memory timestamps to determine when the next checks should be run, notifications be sent, etc. When this skews from the system clock, you can see the "Compensating" message above. All these timestamps are periodically written to disk in retention.dat which is used to store state between reboots. This is why deleting that file causes the queue to refresh.aisebouma wrote:Second, why would nagios stop processing the scheduled queue after a time change?
As mentioned above, the retention.dat file is what causes state to be restored on a reboot, and if the times saved in that file are off then the time will be off when nagios restarts. This is a drawback to be sure, but the alternative is that nothing gets saved on a restart and everything essentially gets re-checked from square one. This can be configured on or off, but the benefits far outweigh the drawbacks. At any rate, a little skew over time is usually dealt with in stride, but 6 hours suddenly is harder to deal with consistently.aisebouma wrote:Third, why is it not robust enough to restart processing the queue when nagios is restarted?
We did see another user recently have a similar issue, but his was every weekend and the clock skew was not nearly as consistent (yours seems to be at least in the 6h range). We still haven't found what caused it on his system, and they did a lot of work up-front on their end to rule things out. A couple things I would like to ask of you:
Are you running mod_gearman?
How precise is the "once a month" estimate? Is it on a particular calendar day? Week day?
Does anything in your environment/network occur around the time that this happens?
Former Nagios employee