retention.dat corruption

morabanc · Post by **morabanc** » Fri Nov 29, 2013 3:27 am

Hello all, in one of our deployments of Nagios Core, we have problems because of corruption of the retention.dat file due to electrical failures an sudden poweroff of the server

The main symptoms are :

- Controls not being updated and not corresponding to reality (false alarms)
- Notificatons not corresponding to reality

From my self-taught experience, the solution is just delete de file and wait for all the controls to be started from scratch

My questions are:

can you confirm that these symtoms are general for nagios and not just for my installation?

can you confirm that i am proceeding adequately by just deleting the file?

are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?

is there a way of making retention.dat more robust in from of power failures?

Thanks a lot

Regards.

sreinhardt · Post by **sreinhardt** » Sat Nov 30, 2013 11:05 am

can you confirm that these symptoms are general for nagios and not just for my installation?

Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.

can you confirm that i am proceeding adequately by just deleting the file?

Unfortunately yes this is one of the few ways to resolve it.

are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?

If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.

is there a way of making retention.dat more robust in from of power failures?

Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.

morabanc · Post by **morabanc** » Tue Dec 03, 2013 5:25 am

Thanks a lot.

Then I just suggest an inmprovement on the daemon: somtheing in its start that detects corrupt retention.dat file and clears it if necessary

As the first reaction on the nagios admin in case o problems is normally to restart the daemon, the problem would be fixed

We have had hours of "nightmare" because of these issue

I take the opportunity to congratulate the nagios tream for such a great product and community

Thanks again

Regards

sreinhardt wrote:
can you confirm that these symptoms are general for nagios and not just for my installation?
Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.
can you confirm that i am proceeding adequately by just deleting the file?
Unfortunately yes this is one of the few ways to resolve it.
are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.
is there a way of making retention.dat more robust in from of power failures?
Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.

slansing · Post by **slansing** » Tue Dec 03, 2013 11:36 am

I'd recommend opening a Nagios Core feature request for this if you'd like to see it included, please visit:

tracker.nagios.org

sreinhardt · Post by **sreinhardt** » Tue Dec 03, 2013 11:39 am

I can't disagree with adding some form of check to the retention.dat file. However completely removing it if it detects any form of corruption is not likely the correct answer. I am sure some would prefer minor amounts of past results despite other issues that might persist. I would highly suggest posting a bug to tracker.nagios.org in reference to this and see what may come about.

Nagios Support Forum

retention.dat corruption

retention.dat corruption

Re: retention.dat corruption

Re: retention.dat corruption

Re: retention.dat corruption

Re: retention.dat corruption