Nagios XI database failure - restore not working

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
roycroft
Posts: 6
Joined: Fri Sep 23, 2011 11:17 am

Nagios XI database failure - restore not working

Post by roycroft »

Hello,

We're running Nagios XI version 2011R3.3.

At some point in the past week it failed. I've been ill with pneumonia, and can't tell exactly when it failed, but the symptoms are that Nagios is running, all my hosts/services are defined in Nagios Core, but XI doesn't know of anything to monitor. I figured it was a database problem.

I attempted a restore from December 20, a date when I know it was working, using the following command:

# /usr/local/nagiosxi/scripts/restore_xi.sh /store/backups/nagiosxi/1356009682.tar.gz

The restore is hanging attempting to restore the MySQL databases:

TS=1357248362
Extracting backup to /store/backups/nagiosxi/1357248362-restore...
In /store/backups/nagiosxi/1357248362-restore/1356009682...
Backup files look okay. Preparing to restore...
Shutting down services...
Stopping nagios: ..........
Warning - nagios did not exit in a timely manner
Stopping ndo2db: done.
NPCD Stopped.
Restoring directories to /...
Restoring Nagios Core...
rm: cannot remove `/usr/local/nagios': Device or resource busy
Restoring Nagios XI...
Restoring NagiosQL...
Restoring NagiosQL backups...
Restoring MySQL databases...

Further, I've found multiple instances of the dbmaint utility running (not sure if this is relevant or not):

nagios 2693 0.0 0.0 2944 956 ? Ss 09:05 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 2699 0.0 0.3 34824 14908 ? S 09:05 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 3228 0.0 0.0 2944 952 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 3230 0.0 0.3 34824 14804 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 4550 0.0 0.0 2944 948 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 4558 0.0 0.3 34824 14896 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 5493 0.0 0.0 2944 952 ? Ss 01:35 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 5502 0.0 0.3 34824 14800 ? S 01:35 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 6204 0.0 0.0 2944 952 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 6207 0.0 0.3 34824 14804 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 6577 0.0 0.0 2944 948 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 6584 0.0 0.3 34824 14800 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 6894 0.0 0.0 2944 1028 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 6899 0.0 0.3 34824 14896 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 7329 0.0 0.0 2944 948 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 7336 0.0 0.3 34824 14804 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 9046 0.0 0.0 2944 948 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 9049 0.0 0.3 34824 14928 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 10313 0.0 0.0 2944 1032 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 10317 0.0 0.3 34824 14800 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 11515 0.0 0.0 2944 952 ? Ss 02:10 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 11523 0.0 0.3 34824 14800 ? S 02:10 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 13248 0.0 0.0 2944 944 ? Ss 10:25 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 13253 0.0 0.3 34824 14784 ? S 10:25 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 14769 0.0 0.0 2944 956 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
nagios 14778 0.0 0.3 34824 14924 ? S Jan02 0:00 /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php
nagios 15071 0.0 0.0 2944 948 ? Ss Jan02 0:00 /bin/sh -c /usr/bin/php -q /usr/local/nagiosxi/cron/dbmaint.php > /usr/local/nagiosxi/var/dbmaint.log 2>&1
...

The machine is running Centos 6.0 (Linux version 2.6.32-71.el6.i686 ([email protected]) (gcc version 4.4.4 20100726 (Red Hat 4.4.4-13) (GCC) ) #1 SMP Fri Nov 12 04:17:17 GMT 2010), running in 32-bit mode.

It is a stand-alone Nagios installation.

Although a long-time Unix user, I have virtually no Linux experience and need help debugging this.

Thanks.
User avatar
nscott
Posts: 1040
Joined: Wed May 11, 2011 8:54 am

Re: Nagios XI database failure - restore not working

Post by nscott »

Looks like you might be out of disk space. How much disk space do you have have left on it?

df -h
Nicholas Scott
Former Nagios employee
roycroft
Posts: 6
Joined: Fri Sep 23, 2011 11:17 am

Re: Nagios XI database failure - restore not working

Post by roycroft »

Most filesystems are at 1% usage or less. /usr is at 29%. There is no issue with disk space.

Thanks.
User avatar
nscott
Posts: 1040
Joined: Wed May 11, 2011 8:54 am

Re: Nagios XI database failure - restore not working

Post by nscott »

The log file its pointing at my yield some information of interest, /usr/local/nagiosxi/var/dbmaint.log, can you look at that file and send us the last 200 or so lines if there are that many?

Also, hope the pneumonia is gone for good, I hear thats nasty.
Nicholas Scott
Former Nagios employee
roycroft
Posts: 6
Joined: Fri Sep 23, 2011 11:17 am

Re: Nagios XI database failure - restore not working

Post by roycroft »

Not that interesting, but somewhat telling, I think:

# tail -200 /usr/local/nagiosxi/var/dbmaint.log
LOCKFILE '/usr/local/nagiosxi/var/dbmaint.lock' EXISTS - EXITING!

I can remove that lock file if it's appropriate. If so, should I also kill all the dbmaint processes?

I'm still recovering from the pneumonia - it's gone, but the doctor says it will take about a month to get my strength back. I don't wish it on anyone.

Thanks.
roycroft
Posts: 6
Joined: Fri Sep 23, 2011 11:17 am

Re: Nagios XI database failure - restore not working

Post by roycroft »

Hello,

I've not heard back for a while, and assume you've left for the day.

Since things were fairly broken anyway, and I need to get Nagios XI working again asap, I thought I'd experiment a bit.

I killed all the dbmaint processes, removed the lockfile, and then ran the repairmysql.sh script.

The hosts/services now appear again in the Host Detail/Service Detail, respectively, but they all show the last check on 25 December.

I'm going to give it a bit to see if the checks start up again, but would appreciate further advice on ensuring that everything is OK.

Should I go ahead and do another attempt at a restore, as I attempted earlier, now that the database has been repaired?

Thanks.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios XI database failure - restore not working

Post by scottwilkerson »

You are probably alright now.

One thing to realize is that is is possible that there was a database problem when you made the backup and you could have restored a problem.

You likely just needed the database repaired, the dates should come back now that they can get updated in the DB
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
roycroft
Posts: 6
Joined: Fri Sep 23, 2011 11:17 am

Re: Nagios XI database failure - restore not working

Post by roycroft »

Hi.

When I attempted the database restore I used a backup from a date when I know things were working.

FYI, what I ended up doing was what I described above, then I shut down and restarted nagios, and restarted the database engine. I'm getting notifications just fine now, so I think everything is good. Thanks for your help. While I ended up figuring out the final solution myself, you folks nudged me in the right direction, and that saved a lot of time.
Locked