Fixing Retention.dat when Last State Change is in the Future
Posted: Tue Apr 23, 2019 5:35 pm
I had a situation where the Last State Change value was something silly like 1/18/2039. Maybe I was mucking with the internal clock, who knows.
Whatever I did, Nagios did not like.
Hosts were being checked but their host up or down state changes would not trigger the correct actions. I believe this was because of the retention.dat holding the last state change as incorrect.
I found the file in the archives where I had mucked with the clock:
So I went through the file and added exactly 631151828 seconds to each timestamp that started with "[91" -- just a quick vim macro.
I saved this and quit and replaced it and restarted nagios. But alas, I still had all that broken stuff. So I looked in retention.dat and found stuff like:
Well I didn't want to lose all my historical data, and there wasn't an easy documented way I could find to regenerate the retention.dat file using the data in the archives.
So using a bit of math, I found that if I subtracted 631151828 (the number of seconds my server had jumped ahead) from the values of "last_check" and "last_state_change" and "last_hard_state_change" values in retention.dat, they came out to be more sane.
Made a backup of the original and a newly generated retention.dat file, put the modified one in, et voila, things are working again!
Hope this will help someone else with a similar issue.
Whatever I did, Nagios did not like.
Hosts were being checked but their host up or down state changes would not trigger the correct actions. I believe this was because of the retention.dat holding the last state change as incorrect.
I found the file in the archives where I had mucked with the clock:
Code: Select all
[916955650] SERVICE NOTIFICATION: admin-email;web-server;Web;OK;notify-by-email;HTTP OK: HTTP/1.1 200 OK - 218 bytes in 0.251 second response time
[1548107486] Warning: A system time change of 631151828 seconds (7304d 23h 57m 8s forwards in time) has been detected. Compensating...
I saved this and quit and replaced it and restarted nagios. But alas, I still had all that broken stuff. So I looked in retention.dat and found stuff like:
Code: Select all
last_check=2179253410
So using a bit of math, I found that if I subtracted 631151828 (the number of seconds my server had jumped ahead) from the values of "last_check" and "last_state_change" and "last_hard_state_change" values in retention.dat, they came out to be more sane.
Made a backup of the original and a newly generated retention.dat file, put the modified one in, et voila, things are working again!
Hope this will help someone else with a similar issue.