Page 1 of 1
retention.dat corruption
Posted: Fri Nov 29, 2013 3:27 am
by morabanc
Hello all, in one of our deployments of Nagios Core, we have problems because of corruption of the retention.dat file due to electrical failures an sudden poweroff of the server
The main symptoms are :
- Controls not being updated and not corresponding to reality (false alarms)
- Notificatons not corresponding to reality
From my self-taught experience, the solution is just delete de file and wait for all the controls to be started from scratch
My questions are:
can you confirm that these symtoms are general for nagios and not just for my installation?
can you confirm that i am proceeding adequately by just deleting the file?
are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
is there a way of making retention.dat more robust in from of power failures?
Thanks a lot
Regards.
Re: retention.dat corruption
Posted: Sat Nov 30, 2013 11:05 am
by sreinhardt
can you confirm that these symptoms are general for nagios and not just for my installation?
Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.
can you confirm that i am proceeding adequately by just deleting the file?
Unfortunately yes this is one of the few ways to resolve it.
are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.
is there a way of making retention.dat more robust in from of power failures?
Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.
Re: retention.dat corruption
Posted: Tue Dec 03, 2013 5:25 am
by morabanc
Thanks a lot.
Then I just suggest an inmprovement on the daemon: somtheing in its start that detects corrupt retention.dat file and clears it if necessary
As the first reaction on the nagios admin in case o problems is normally to restart the daemon, the problem would be fixed
We have had hours of "nightmare" because of these issue
I take the opportunity to congratulate the nagios tream for such a great product and community
Thanks again
Regards
sreinhardt wrote:can you confirm that these symptoms are general for nagios and not just for my installation?
Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.
can you confirm that i am proceeding adequately by just deleting the file?
Unfortunately yes this is one of the few ways to resolve it.
are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.
is there a way of making retention.dat more robust in from of power failures?
Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.
Re: retention.dat corruption
Posted: Tue Dec 03, 2013 11:36 am
by slansing
I'd recommend opening a Nagios Core feature request for this if you'd like to see it included, please visit:
tracker.nagios.org
Re: retention.dat corruption
Posted: Tue Dec 03, 2013 11:39 am
by sreinhardt
I can't disagree with adding some form of check to the retention.dat file. However completely removing it if it detects any form of corruption is not likely the correct answer. I am sure some would prefer minor amounts of past results despite other issues that might persist. I would highly suggest posting a bug to tracker.nagios.org in reference to this and see what may come about.