Page 1 of 2

Database Crash and Corrupt

Posted: Wed Dec 18, 2013 3:42 pm
by mikew
I am working on a Nagios XI server 2012R2.5 which was running fine until last night the / partition filled up and the database crashed. I do not know where several GB of data came from....still have to figure that out. But no one can login.

I have checked the MySQL database nagios and it was corrupt but I fixed that with the --repair option and mysqlcheck. The nagiosql database came out OK. (I used the script in XI to fix the MySQL databases) However, I expect that the postgresql database is a problem as no one can login. What is the best way to fix that database?

I do have backups, just wanted to fix it first.

I am now seeing other issues:
* the permissions on /usr/local/nagios/var/rw/nagios.cmd are:
-rw-rw-r root:nagcmd

The nagios.log shows it is bailing out because the permissions are wrong on this file...as the named pipe is not working.

So I know the permissions are wrong. I have deleted all lock files in the /usr/local/nagios/var/ directory and restarted but permissions return incorrect. I have restarted the server...no help.

System is a CentOS 6.x 64-bit
32 vCPU and 12 GB of RAM

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 4:57 pm
by abrist
Are you still out of space, or did you increase the provisioned disk size? If it is still full, we need to clear up space before we do anything else.

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 4:58 pm
by mikew
It has space now.

grep nag /etc/group
nagios:x:501:nagios,apache
nagcmd:x:502:nagios,apache

Each time nagios is started the permissions on nagios.cmd go back to

-rw-rw-r root:nagcmd

Which of course means it will bail and not work.

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 5:05 pm
by abrist
The command pipe permissions should be:

Code: Select all

prw-rw---- 1 nagios nagcmd 0 Dec 18 13:10 /usr/local/nagios/var/rw/nagios.cmd
To change the permissions:

Code: Select all

cd /usr/local/nagios/var/rw/
mknod /usr/local/nagios/var/rw/nagios.cmd p
chown nagios:nagcmd nagios.cmd
chmod 660 nagios.cmd
ls -la
service nagios restart

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 5:13 pm
by mikew
As soon as I remove the file it is recreated by some process. I have done a killall -9 nagios


So I have nagios running with the right permissions:

I had to write a script to remove and then set permissions.

BUT..no one can login still....

I am now able to login to nagios core but not XI

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 5:45 pm
by abrist
Do you have a umask set on the system? Or selinux running?

Code: Select all

umask
getenforce
Additionally, did you reboot the server after clearing space? I ask because that is often the easiest way to make sure your services come up. I am available tomorrow to elevate this to a remote session if need be.

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 5:49 pm
by mikew
No selinux, this is a system that has been running for 6 weeks with no problems. The crash has damaged the postgresql database I think.

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 5:51 pm
by abrist
You an attempt to run a vacuum on it:
http://support.nagios.com/wiki/index.ph ... .22_in_log

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 6:02 pm
by mikew
So I did those commands....no joy.

Re: Database Crash and Corrupt

Posted: Wed Dec 18, 2013 7:12 pm
by mikew
I was able to get back in by using this script:

/usr/local/nagiosxi/scripts/reset_nagiosadmin_password.php --password=xxxxxx

However, my fear is that there are more permission issues here so I will do some checking and report back if I find anything more.

Note: When this happened no account would work.