Nagios DB Crash Help

jbennett · Post by **jbennett** » Mon Sep 23, 2013 8:48 am

I came in this AM to a notificationt hat the DB couldn't be reached by Nagios.

I checked disk space and saw where my RAM disk had run out of space. In the past, I've run into issues where the logentries and lognotifications tables are very large (perhaps due to the number of checks we are running (19,000+).

I have resolved this in the past by repairing the database, but by running the command to truncate those two tables.

Upon doing that this time around though, I am getting the following errors:

Code: Select all

- recovering (with sort) MyISAM-table 'nagios_contactnotificationmethods.MYI'
Data records: 116497
- Fixing index 1
/usr/bin/myisamchk: Can't create/write to file '/tmp/STk6hcJB' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_contactnotificationmethods.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag

Code: Select all

- recovering (with sort) MyISAM-table 'nagios_objects.MYI'
Data records: 14461
- Fixing index 1
- Fixing index 2
/usr/bin/myisamchk: Can't create/write to file '/tmp/ST2Y2yuf' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_objects.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag

Code: Select all

- recovering (with sort) MyISAM-table 'nagios_statehistory.MYI'
Data records: 116310
- Fixing index 1
/usr/bin/myisamchk: Can't create/write to file '/tmp/ST5EMbsL' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_statehistory.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag

If I attempt to run the myisamchk for one of the above mentioned tables, I see the following:

Code: Select all

[nagios]# cd /var/lib/mysql/nagios
[nagios]# myisamchk -r -f nagios_objects.MYI
- recovering (with sort) MyISAM-table 'nagios_objects.MYI'
Data records: 14461
- Fixing index 1
- Fixing index 2
myisamchk: Can't create/write to file '/tmp/STqGoX5y' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_objects.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag

When I check the log, I see the following as the latest entry:

Code: Select all

130923  8:27:07 [ERROR] /usr/libexec/mysqld: Table './nagios/nagios_objects' is marked as crashed and last (automatic?) repair failed

I have tried rebooting the service, stopping nagios and mysqld immedately upon reboot and running the db repair, only to get the same error (28). When I check disk space at this point, it appears that I have space still:

Code: Select all

[nagios]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00_ROOT
                       48G   38G  7.5G  84% /
/dev/mapper/VolGroup00-LogVol00
                      3.0G  1.6G  1.2G  58% /tmp
/dev/mapper/VolGroup00-LogVol00_VAR
                      5.7G  3.6G  1.9G  66% /var
/dev/hda1             190M   47M  134M  26% /boot
tmpfs                 5.9G     0  5.9G   0% /dev/shm
tmpfs                 125M   64M   62M  52% /var/nagiosramdisk

I tried running the commands outlined later on in the repair DB document for the above mentioned tables.

Code: Select all

service mysqld stop
cd /var/lib/mysql/nagios
myisamchk -r -f nagios_<corrupted_table>
service mysqld start
rm -f /usr/local/nagiosxi/var/dbmaint.lock
/usr/local/nagiosxi/cron/dbmaint.php

I was not able to use the --force option successfully. However, if I used the --safe-recover (-o) option, it appears to have gone through ok, however I'm still getting errors when I try and run the repair tool again.

Post by **lmiltchev** » Mon Sep 23, 2013 10:14 am

What are the permissions on the /tmp?

Code: Select all

ll -d /tmp

jbennett · Post by **jbennett** » Mon Sep 23, 2013 10:30 am

Code: Select all

# ll -d /tmp
drwxrwxrwt 6 root root 20676608 Sep 23 10:00 /tmp

EDIT:

When I tried to copy the NagoiosXI-FixPerms.sh script to the /tmp directory, I get an error that there is no space left on the device.

However, per the following, I still have plenty of space?

Code: Select all

]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00_ROOT
                       48G   36G  9.5G  79% /
/dev/mapper/VolGroup00-LogVol00
                      3.0G  1.6G  1.2G  58% /tmp
/dev/mapper/VolGroup00-LogVol00_VAR
                      5.7G  3.7G  1.8G  68% /var
/dev/hda1             190M   47M  134M  26% /boot
tmpfs                 5.9G     0  5.9G   0% /dev/shm
tmpfs                 125M   51M   75M  41% /var/nagiosramdisk
10.100.3.220:/kickstart
                      190G  143G   38G  80% /kickstart

EDIT 2: This lead me to take a look at the /tmp folder. Turns out it was full, but it wasn't showing as full (?). I removed all of the orphaned checks and was able to successfully repair the database. Nagios is back up and running now.

However, I would like to know what caused this to happen in the first place as it's not the first time.

Any ideas based upon what I've already come across?

Post by **lmiltchev** » Mon Sep 23, 2013 11:19 am

Can you show us the command you ran to copy the script to the /tmp, and the output of it? Also, show the output of:

Code: Select all

df -i

jbennett · Post by **jbennett** » Mon Sep 23, 2013 11:28 am

lmiltchev wrote:Can you show us the command you ran to copy the script to the /tmp, and the output of it? Also, show the output of:
Code: Select all
df -i

Just moving it from my folder on the server to the /tmp folder:

Code: Select all

# mv NagiosXI-FixPerms.sh /tmp

It is far easier for me to download the script, scp it to my folder on the server (I cannot directly copy to /tmp) then change permissions than it is for me to open a ticket for the server guys to configure the proxy and download it directly. It is generally far quicker (by days) as well.

Code: Select all

# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup00-LogVol00_ROOT
                     12799776  188490 12611286    2% /
/dev/mapper/VolGroup00-LogVol00
                      793600      25  793575    1% /tmp
/dev/mapper/VolGroup00-LogVol00_VAR
                     1540096   90411 1449685    6% /var
/dev/hda1              50200      52   50148    1% /boot
tmpfs                1538707       1 1538706    1% /dev/shm
tmpfs                1538707      27 1538680    1% /var/nagiosramdisk
10.100.3.220:/kickstart
                     51216384  270850 50945534    1% /kickstart

abrist · Post by **abrist** » Mon Sep 23, 2013 3:33 pm

When /tmp is full, it is usually due to check* files. This means than at some point, checks were getting saved to disk, but not reaped. The most common cause are passive checks that continue to get written to /tmp while the nagios process is stopped. This can also happen if you had multiple nagios parent processes running concurrently.

jbennett · Post by **jbennett** » Mon Sep 23, 2013 4:08 pm

And what is the best way to clear these? Just going through the /tmp directory and manually deleting them?

Being that this happened over the weekend, when no one else was in the system, I'm wondering if this might have been caused by the Nagios process getting stuck.

It seems that I'm still running into issues with incredibly high CPU usage since this all happened. I'll routinely be over 90% now.

abrist · Post by **abrist** » Mon Sep 23, 2013 4:13 pm

The best way to remove the checks is to use a wildcard:

Code: Select all

rm -f /tmp/check*

jbennett · Post by **jbennett** » Tue Sep 24, 2013 7:13 am

I would still like to know why this is happening in the first place.

I came in this moring and the system is completely locked up. I can't even access it via console.

From yesterday, I notced that /dev/shm was full. My understanding is that this is swap memory?

I was thinking that Nagios would finish processing and swap usage woudld drop back down.

What can I check to see what is causing everything to start building up and not process?

abrist · Post by **abrist** » Tue Sep 24, 2013 9:30 am

Do you use a ramdisk? (this can fill up a small small /tmp partition quick if npcd has an issue.

Checks get reaped by nagios, so unreaped "check*" files is usually due to nagios not running. You may get some hints as to what caused this by looking at your logs for the evening when things failed. Additionally, check the Audit Log in XI for any changes made late that night.

Nagios Support Forum

Nagios DB Crash Help

Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help

Re: Nagios DB Crash Help