Hello,
This morning I attempted to restore from a backup. It turns out that the backup was corrupt. Despite this, the restore script kept running and totally destroyed our Nagios install.
Here is the output from the restore script:
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
TS=1581959036
Extracting backup to /store/backups/nagiosxi/1581959036-restore...
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
In /store/backups/nagiosxi/1581959036-restore/nagiosxi.1581926402...
Backup files look okay. Preparing to restore...
Shutting down services...
Stopping nagios: done.
Stopping ndo2db: done.
NPCD Stopped.
Restoring directories to /...
Restoring Nagios Core...
gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
Restoring Nagios XI...
Restoring NagiosQL...
Restoring NagiosQL backups...
Restoring NRDP backups...
Restoring MRTG...
Restoring SNMP configuration files...
Restoring SNMP MIBs...
Restoring nagios home dir...
Restoring MySQL databases...
./restore_xi.sh: line 261: /store/backups/nagiosxi/1581959036-restore/nagiosxi.1581926402/mysql/nagios.sql: No such file or directory
Error restoring MySQL database 'nagios' - check the password in this script!
Huge chunks of Nagios and NagiosXI are now missing, the database is gone, and we're totally down.
I will leave aside the obvious questions of why the restore script performs no error checking and why our backups are silently corrupt.
Is there ANY way to recover from this?
-- Mike Beebe
Restore from a corrupted backup has killed Nagios
Re: Restore from a corrupted backup has killed Nagios
A couple of questions.
Do you have any other backup tar files? Or is this a virtual machine that has a snapshot you can restore to, or have a VM level backup?
Also can you create a ticket for this issue?
Do you have any other backup tar files? Or is this a virtual machine that has a snapshot you can restore to, or have a VM level backup?
Also can you create a ticket for this issue?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Restore from a corrupted backup has killed Nagios
Hi Mbellerue,
Before I begin, I want you to know I'm not mad at you or any of the other helpful Nagios employees who've offered invaluable information and advice to myself and my company. I appreciate this more than you can know. I know you guys are doing your best, and your best routinely goes beyond excellent. For this, I thank you and the others.
On the issue at hand, we couldn't wait any longer for a response. We were forced to rebuild the entire Nagios instance from scratch. We lost almost everything.
We discovered the root cause was two-fold:
1. The Nagios backup script performs no basic error checking and when it fails, it does so silently. In our case, we were running out of disk space during the backup. Instead of aborting the process, the script writes a corrupted backup.
2. The restore script performs destructive actions without providing a fallback in case of failure. In our case, the error about the corrupted backup was simply ignored and the script continued to remove large portions of the Nagios/Nagiosxi file structure with no hope of recovery.
We rebuilt the system yesterday and were able to perform a snapshot restore from the previous day, however this did not restore the users or their data. They will have to be re-added by hand.
In the future, we will file tickets or use your phone support system.
Might as well close this ticket as there's nothing else that really needs to be said.
-- Mike Beebe
Before I begin, I want you to know I'm not mad at you or any of the other helpful Nagios employees who've offered invaluable information and advice to myself and my company. I appreciate this more than you can know. I know you guys are doing your best, and your best routinely goes beyond excellent. For this, I thank you and the others.
On the issue at hand, we couldn't wait any longer for a response. We were forced to rebuild the entire Nagios instance from scratch. We lost almost everything.
We discovered the root cause was two-fold:
1. The Nagios backup script performs no basic error checking and when it fails, it does so silently. In our case, we were running out of disk space during the backup. Instead of aborting the process, the script writes a corrupted backup.
2. The restore script performs destructive actions without providing a fallback in case of failure. In our case, the error about the corrupted backup was simply ignored and the script continued to remove large portions of the Nagios/Nagiosxi file structure with no hope of recovery.
We rebuilt the system yesterday and were able to perform a snapshot restore from the previous day, however this did not restore the users or their data. They will have to be re-added by hand.
In the future, we will file tickets or use your phone support system.
Might as well close this ticket as there's nothing else that really needs to be said.
-- Mike Beebe
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Restore from a corrupted backup has killed Nagios
Hi Mike,
I really appreciate your detailed feedback on what happened and I've passed this information along to the development team.
Going forward, if you need any assistance getting things back online, just let us know. And regarding opening a ticket vs the forum, if the system is down we often escalate it to a ticket for faster resolution and the possible remote session.
I really appreciate your detailed feedback on what happened and I've passed this information along to the development team.
Going forward, if you need any assistance getting things back online, just let us know. And regarding opening a ticket vs the forum, if the system is down we often escalate it to a ticket for faster resolution and the possible remote session.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!