Scheduling queue freezes

jfrickson · Post by **jfrickson** » Tue Dec 22, 2015 4:40 pm

When it freezes, is the cpu usage high?

aisebouma wrote:About once a month the scheduling queue freezes and even restarting nagios does not resolve it, I also have to delete retention.dat.

Send us a copy of the retention.dat. If deleting it resolves it, there might be something in there to point me to where the problem lies. A copy of a normally running retention.dat would also help.

Also, does anything special or unusual happen about once a month? [EDIT] Trevor beat me to this question!

aisebouma · Post by **aisebouma** » Wed Dec 23, 2015 10:49 am

tmcdonald wrote:
aisebouma wrote:First of all a 1500 seconds time deviation is not normal for ntp.
We agree, @jolson mentioned this above.

aisebouma wrote:Second, why would nagios stop processing the scheduled queue after a time change?
Nagios uses in-memory timestamps to determine when the next checks should be run, notifications be sent, etc. When this skews from the system clock, you can see the "Compensating" message above. All these timestamps are periodically written to disk in retention.dat which is used to store state between reboots. This is why deleting that file causes the queue to refresh.

aisebouma wrote:Third, why is it not robust enough to restart processing the queue when nagios is restarted?
As mentioned above, the retention.dat file is what causes state to be restored on a reboot, and if the times saved in that file are off then the time will be off when nagios restarts. This is a drawback to be sure, but the alternative is that nothing gets saved on a restart and everything essentially gets re-checked from square one. This can be configured on or off, but the benefits far outweigh the drawbacks. At any rate, a little skew over time is usually dealt with in stride, but 6 hours suddenly is harder to deal with consistently.

We did see another user recently have a similar issue, but his was every weekend and the clock skew was not nearly as consistent (yours seems to be at least in the 6h range). We still haven't found what caused it on his system, and they did a lot of work up-front on their end to rule things out. A couple things I would like to ask of you:

Are you running mod_gearman?
How precise is the "once a month" estimate? Is it on a particular calendar day? Week day?
Does anything in your environment/network occur around the time that this happens?

That are the technical answers. I do not think it is the right functional solution (at least in my case). I do not run mod_gearman. The once a month is not very precise, I will check if I can find a pattern.

rkennedy · Post by **rkennedy** » Wed Dec 23, 2015 11:57 am

Let us know if you need any further assistance from our team.

aisebouma · Post by **aisebouma** » Thu Dec 24, 2015 4:07 am

jfrickson wrote:When it freezes, is the cpu usage high?

aisebouma wrote:About once a month the scheduling queue freezes and even restarting nagios does not resolve it, I also have to delete retention.dat.
Send us a copy of the retention.dat. If deleting it resolves it, there might be something in there to point me to where the problem lies. A copy of a normally running retention.dat would also help.

Also, does anything special or unusual happen about once a month? [EDIT] Trevor beat me to this question!

Here are the dates it happened:
01-13-2015
01-24-2015
03-05-2015
03-09-2015
03-11-2015
03-17-2015
05-13-2015
08-06-2015
11-19-2015
11-25-2015
12-08-2015
Seems pretty random to me, happens even on Sundays when our business is closed.
I gave up on Nagios in May, but started with a new version in August, unfortunetaly the problem was not solved.

I will try to send you a copy of retention.dat via pm

hsmith · Post by **hsmith** » Mon Dec 28, 2015 11:16 am

Let us know when you have sent the file. Thanks.

jfrickson · Post by **jfrickson** » Mon Dec 28, 2015 12:17 pm

@hsmith I got the files in email. Checking them out.

aisebouma · Post by **aisebouma** » Wed Jan 06, 2016 5:54 am

Did the retention.dat files uncover a possible cause of the problem?

tmcdonald · Post by **tmcdonald** » Wed Jan 06, 2016 6:13 pm

Haven't gotten an answer from our Dev yet, but with the holidays we haven't been in the office much until this week.

In the meantime though, did you ever find out if anything was happening on the network when this occurs? Backups, scans, updates, reboots, anything that might cause some interruption or slowness?

aisebouma · Post by **aisebouma** » Thu Jan 07, 2016 5:36 am

tmcdonald wrote:Haven't gotten an answer from our Dev yet, but with the holidays we haven't been in the office much until this week.

In the meantime though, did you ever find out if anything was happening on the network when this occurs? Backups, scans, updates, reboots, anything that might cause some interruption or slowness?

No, it seems to occur totally random.

rkennedy · Post by **rkennedy** » Thu Jan 07, 2016 5:45 pm

I found a post here that may correlate to the issue you are experiencing - https://support.nagios.com/forum/viewto ... 10#p112580

Can you post the contents of the file below for us to review?

Code: Select all

/etc/sysctl.conf

Nagios Support Forum

Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes

Re: Scheduling queue freezes