Hello all, in one of our deployments of Nagios Core, we have problems because of corruption of the retention.dat file due to electrical failures an sudden poweroff of the server
The main symptoms are :
- Controls not being updated and not corresponding to reality (false alarms)
- Notificatons not corresponding to reality
From my self-taught experience, the solution is just delete de file and wait for all the controls to be started from scratch
My questions are:
can you confirm that these symtoms are general for nagios and not just for my installation?
can you confirm that i am proceeding adequately by just deleting the file?
are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
is there a way of making retention.dat more robust in from of power failures?
Thanks a lot
Regards.
retention.dat corruption
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: retention.dat corruption
Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.can you confirm that these symptoms are general for nagios and not just for my installation?
Unfortunately yes this is one of the few ways to resolve it.can you confirm that i am proceeding adequately by just deleting the file?
If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.is there a way of making retention.dat more robust in from of power failures?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Re: retention.dat corruption
Thanks a lot.
Then I just suggest an inmprovement on the daemon: somtheing in its start that detects corrupt retention.dat file and clears it if necessary
As the first reaction on the nagios admin in case o problems is normally to restart the daemon, the problem would be fixed
We have had hours of "nightmare" because of these issue
I take the opportunity to congratulate the nagios tream for such a great product and community
Thanks again
Regards
Then I just suggest an inmprovement on the daemon: somtheing in its start that detects corrupt retention.dat file and clears it if necessary
As the first reaction on the nagios admin in case o problems is normally to restart the daemon, the problem would be fixed
We have had hours of "nightmare" because of these issue
I take the opportunity to congratulate the nagios tream for such a great product and community
Thanks again
Regards
sreinhardt wrote:Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.can you confirm that these symptoms are general for nagios and not just for my installation?Unfortunately yes this is one of the few ways to resolve it.can you confirm that i am proceeding adequately by just deleting the file?If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.is there a way of making retention.dat more robust in from of power failures?
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: retention.dat corruption
I'd recommend opening a Nagios Core feature request for this if you'd like to see it included, please visit:
tracker.nagios.org
tracker.nagios.org
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: retention.dat corruption
I can't disagree with adding some form of check to the retention.dat file. However completely removing it if it detects any form of corruption is not likely the correct answer. I am sure some would prefer minor amounts of past results despite other issues that might persist. I would highly suggest posting a bug to tracker.nagios.org in reference to this and see what may come about.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.