Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
aisebouma wrote:About once a month the scheduling queue freezes and even restarting nagios does not resolve it, I also have to delete retention.dat.
Send us a copy of the retention.dat. If deleting it resolves it, there might be something in there to point me to where the problem lies. A copy of a normally running retention.dat would also help.
Also, does anything special or unusual happen about once a month? [EDIT] Trevor beat me to this question!
aisebouma wrote:Second, why would nagios stop processing the scheduled queue after a time change?
Nagios uses in-memory timestamps to determine when the next checks should be run, notifications be sent, etc. When this skews from the system clock, you can see the "Compensating" message above. All these timestamps are periodically written to disk in retention.dat which is used to store state between reboots. This is why deleting that file causes the queue to refresh.
aisebouma wrote:Third, why is it not robust enough to restart processing the queue when nagios is restarted?
As mentioned above, the retention.dat file is what causes state to be restored on a reboot, and if the times saved in that file are off then the time will be off when nagios restarts. This is a drawback to be sure, but the alternative is that nothing gets saved on a restart and everything essentially gets re-checked from square one. This can be configured on or off, but the benefits far outweigh the drawbacks. At any rate, a little skew over time is usually dealt with in stride, but 6 hours suddenly is harder to deal with consistently.
We did see another user recently have a similar issue, but his was every weekend and the clock skew was not nearly as consistent (yours seems to be at least in the 6h range). We still haven't found what caused it on his system, and they did a lot of work up-front on their end to rule things out. A couple things I would like to ask of you:
Are you running mod_gearman?
How precise is the "once a month" estimate? Is it on a particular calendar day? Week day?
Does anything in your environment/network occur around the time that this happens?
That are the technical answers. I do not think it is the right functional solution (at least in my case). I do not run mod_gearman. The once a month is not very precise, I will check if I can find a pattern.
jfrickson wrote:When it freezes, is the cpu usage high?
aisebouma wrote:About once a month the scheduling queue freezes and even restarting nagios does not resolve it, I also have to delete retention.dat.
Send us a copy of the retention.dat. If deleting it resolves it, there might be something in there to point me to where the problem lies. A copy of a normally running retention.dat would also help.
Also, does anything special or unusual happen about once a month? [EDIT] Trevor beat me to this question!
Here are the dates it happened:
01-13-2015
01-24-2015
03-05-2015
03-09-2015
03-11-2015
03-17-2015
05-13-2015
08-06-2015
11-19-2015
11-25-2015
12-08-2015
Seems pretty random to me, happens even on Sundays when our business is closed.
I gave up on Nagios in May, but started with a new version in August, unfortunetaly the problem was not solved.
I will try to send you a copy of retention.dat via pm
Haven't gotten an answer from our Dev yet, but with the holidays we haven't been in the office much until this week.
In the meantime though, did you ever find out if anything was happening on the network when this occurs? Backups, scans, updates, reboots, anything that might cause some interruption or slowness?
tmcdonald wrote:Haven't gotten an answer from our Dev yet, but with the holidays we haven't been in the office much until this week.
In the meantime though, did you ever find out if anything was happening on the network when this occurs? Backups, scans, updates, reboots, anything that might cause some interruption or slowness?