Page 1 of 1

Fixing Retention.dat when Last State Change is in the Future

Posted: Tue Apr 23, 2019 5:35 pm
by ooglek
I had a situation where the Last State Change value was something silly like 1/18/2039. Maybe I was mucking with the internal clock, who knows.

Whatever I did, Nagios did not like.

Hosts were being checked but their host up or down state changes would not trigger the correct actions. I believe this was because of the retention.dat holding the last state change as incorrect.

I found the file in the archives where I had mucked with the clock:

Code: Select all

[916955650] SERVICE NOTIFICATION: admin-email;web-server;Web;OK;notify-by-email;HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.251 second response time
[1548107486] Warning: A system time change of 631151828 seconds (7304d 23h 57m 8s forwards in time) has been detected.  Compensating...
So I went through the file and added exactly 631151828 seconds to each timestamp that started with "[91" -- just a quick vim macro.

I saved this and quit and replaced it and restarted nagios. But alas, I still had all that broken stuff. So I looked in retention.dat and found stuff like:

Code: Select all

last_check=2179253410
Well I didn't want to lose all my historical data, and there wasn't an easy documented way I could find to regenerate the retention.dat file using the data in the archives.

So using a bit of math, I found that if I subtracted 631151828 (the number of seconds my server had jumped ahead) from the values of "last_check" and "last_state_change" and "last_hard_state_change" values in retention.dat, they came out to be more sane.

Made a backup of the original and a newly generated retention.dat file, put the modified one in, et voila, things are working again!

Hope this will help someone else with a similar issue.

Re: Fixing Retention.dat when Last State Change is in the Fu

Posted: Wed Apr 24, 2019 1:34 pm
by benjaminsmith
Hi @ooglek
Hope this will help someone else with a similar issue.
Thanks for posting your solution to the forum!