retention.dat corruption

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
morabanc
Posts: 199
Joined: Tue Jul 10, 2012 8:14 am

retention.dat corruption

Post by morabanc »

Hello all, in one of our deployments of Nagios Core, we have problems because of corruption of the retention.dat file due to electrical failures an sudden poweroff of the server

The main symptoms are :

- Controls not being updated and not corresponding to reality (false alarms)
- Notificatons not corresponding to reality

From my self-taught experience, the solution is just delete de file and wait for all the controls to be started from scratch

My questions are:

can you confirm that these symtoms are general for nagios and not just for my installation?

can you confirm that i am proceeding adequately by just deleting the file?

are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?

is there a way of making retention.dat more robust in from of power failures?

Thanks a lot

Regards.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: retention.dat corruption

Post by sreinhardt »

can you confirm that these symptoms are general for nagios and not just for my installation?
Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.
can you confirm that i am proceeding adequately by just deleting the file?
Unfortunately yes this is one of the few ways to resolve it.
are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.
is there a way of making retention.dat more robust in from of power failures?
Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
morabanc
Posts: 199
Joined: Tue Jul 10, 2012 8:14 am

Re: retention.dat corruption

Post by morabanc »

Thanks a lot.

Then I just suggest an inmprovement on the daemon: somtheing in its start that detects corrupt retention.dat file and clears it if necessary

As the first reaction on the nagios admin in case o problems is normally to restart the daemon, the problem would be fixed

We have had hours of "nightmare" because of these issue

I take the opportunity to congratulate the nagios tream for such a great product and community

Thanks again

Regards


sreinhardt wrote:
can you confirm that these symptoms are general for nagios and not just for my installation?
Yes, this has been seen before and much like XI related database issues, if the retention.dat was being written to when the power went off it may corrupt it.
can you confirm that i am proceeding adequately by just deleting the file?
Unfortunately yes this is one of the few ways to resolve it.
are there more files (status.dat , etc ...) prone to get corrupted that I might have to check?
If you are using just nagios core with no database backends, yes that is all that should get corrupted. You might have a few performance data files that have yet to be reaped with issues, but otherwise nothing else that I know of. Making sure that /usr/local/nagios/var/spool/xidpe, /usr/local/nagios/var/spool/perfdata, and /usr/local/nagios/var/spool/checkresults are getting reaped correctly and emptying out from time to time should be a pretty clear indicator that it is working as expected.
is there a way of making retention.dat more robust in from of power failures?
Off hand, the only thought I have is to cron a backup of it to another location on your filesystem or a remote device. This may however present similar issues on the backup file and would need to be tested or nagios service would need to be stopped prior to backup.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: retention.dat corruption

Post by slansing »

I'd recommend opening a Nagios Core feature request for this if you'd like to see it included, please visit:

tracker.nagios.org
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: retention.dat corruption

Post by sreinhardt »

I can't disagree with adding some form of check to the retention.dat file. However completely removing it if it detects any form of corruption is not likely the correct answer. I am sure some would prefer minor amounts of past results despite other issues that might persist. I would highly suggest posting a bug to tracker.nagios.org in reference to this and see what may come about.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Locked