Nagios DB Crash Help

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Nagios DB Crash Help

Post by jbennett »

I came in this AM to a notificationt hat the DB couldn't be reached by Nagios.

I checked disk space and saw where my RAM disk had run out of space. In the past, I've run into issues where the logentries and lognotifications tables are very large (perhaps due to the number of checks we are running (19,000+).

I have resolved this in the past by repairing the database, but by running the command to truncate those two tables.

Upon doing that this time around though, I am getting the following errors:

Code: Select all

- recovering (with sort) MyISAM-table 'nagios_contactnotificationmethods.MYI'
Data records: 116497
- Fixing index 1
/usr/bin/myisamchk: Can't create/write to file '/tmp/STk6hcJB' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_contactnotificationmethods.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag

Code: Select all

- recovering (with sort) MyISAM-table 'nagios_objects.MYI'
Data records: 14461
- Fixing index 1
- Fixing index 2
/usr/bin/myisamchk: Can't create/write to file '/tmp/ST2Y2yuf' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_objects.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag

Code: Select all

- recovering (with sort) MyISAM-table 'nagios_statehistory.MYI'
Data records: 116310
- Fixing index 1
/usr/bin/myisamchk: Can't create/write to file '/tmp/ST5EMbsL' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_statehistory.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag
If I attempt to run the myisamchk for one of the above mentioned tables, I see the following:

Code: Select all

[nagios]# cd /var/lib/mysql/nagios
[nagios]# myisamchk -r -f nagios_objects.MYI
- recovering (with sort) MyISAM-table 'nagios_objects.MYI'
Data records: 14461
- Fixing index 1
- Fixing index 2
myisamchk: Can't create/write to file '/tmp/STqGoX5y' (Errcode: 28)
myisamchk: error: 28 when fixing table
MyISAM-table 'nagios_objects.MYI' is not fixed because of errors
Try fixing it by using the --safe-recover (-o), the --force (-f) option or by not using the --quick (-q) flag
When I check the log, I see the following as the latest entry:

Code: Select all

130923  8:27:07 [ERROR] /usr/libexec/mysqld: Table './nagios/nagios_objects' is marked as crashed and last (automatic?) repair failed
I have tried rebooting the service, stopping nagios and mysqld immedately upon reboot and running the db repair, only to get the same error (28). When I check disk space at this point, it appears that I have space still:

Code: Select all

[nagios]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00_ROOT
                       48G   38G  7.5G  84% /
/dev/mapper/VolGroup00-LogVol00
                      3.0G  1.6G  1.2G  58% /tmp
/dev/mapper/VolGroup00-LogVol00_VAR
                      5.7G  3.6G  1.9G  66% /var
/dev/hda1             190M   47M  134M  26% /boot
tmpfs                 5.9G     0  5.9G   0% /dev/shm
tmpfs                 125M   64M   62M  52% /var/nagiosramdisk
I tried running the commands outlined later on in the repair DB document for the above mentioned tables.

Code: Select all

service mysqld stop
cd /var/lib/mysql/nagios
myisamchk -r -f nagios_<corrupted_table>
service mysqld start
rm -f /usr/local/nagiosxi/var/dbmaint.lock
/usr/local/nagiosxi/cron/dbmaint.php
I was not able to use the --force option successfully. However, if I used the --safe-recover (-o) option, it appears to have gone through ok, however I'm still getting errors when I try and run the repair tool again.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios DB Crash Help

Post by lmiltchev »

What are the permissions on the /tmp?

Code: Select all

ll -d /tmp
Be sure to check out our Knowledgebase for helpful articles and solutions!
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Nagios DB Crash Help

Post by jbennett »

Code: Select all

# ll -d /tmp
drwxrwxrwt 6 root root 20676608 Sep 23 10:00 /tmp
EDIT:

When I tried to copy the NagoiosXI-FixPerms.sh script to the /tmp directory, I get an error that there is no space left on the device.

However, per the following, I still have plenty of space?

Code: Select all

]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00_ROOT
                       48G   36G  9.5G  79% /
/dev/mapper/VolGroup00-LogVol00
                      3.0G  1.6G  1.2G  58% /tmp
/dev/mapper/VolGroup00-LogVol00_VAR
                      5.7G  3.7G  1.8G  68% /var
/dev/hda1             190M   47M  134M  26% /boot
tmpfs                 5.9G     0  5.9G   0% /dev/shm
tmpfs                 125M   51M   75M  41% /var/nagiosramdisk
10.100.3.220:/kickstart
                      190G  143G   38G  80% /kickstart
EDIT 2: This lead me to take a look at the /tmp folder. Turns out it was full, but it wasn't showing as full (?). I removed all of the orphaned checks and was able to successfully repair the database. Nagios is back up and running now.

However, I would like to know what caused this to happen in the first place as it's not the first time.

Any ideas based upon what I've already come across?
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: Nagios DB Crash Help

Post by lmiltchev »

Can you show us the command you ran to copy the script to the /tmp, and the output of it? Also, show the output of:

Code: Select all

df -i
Be sure to check out our Knowledgebase for helpful articles and solutions!
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Nagios DB Crash Help

Post by jbennett »

lmiltchev wrote:Can you show us the command you ran to copy the script to the /tmp, and the output of it? Also, show the output of:

Code: Select all

df -i
Just moving it from my folder on the server to the /tmp folder:

Code: Select all

# mv NagiosXI-FixPerms.sh /tmp
It is far easier for me to download the script, scp it to my folder on the server (I cannot directly copy to /tmp) then change permissions than it is for me to open a ticket for the server guys to configure the proxy and download it directly. It is generally far quicker (by days) as well.

Code: Select all

# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/VolGroup00-LogVol00_ROOT
                     12799776  188490 12611286    2% /
/dev/mapper/VolGroup00-LogVol00
                      793600      25  793575    1% /tmp
/dev/mapper/VolGroup00-LogVol00_VAR
                     1540096   90411 1449685    6% /var
/dev/hda1              50200      52   50148    1% /boot
tmpfs                1538707       1 1538706    1% /dev/shm
tmpfs                1538707      27 1538680    1% /var/nagiosramdisk
10.100.3.220:/kickstart
                     51216384  270850 50945534    1% /kickstart
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios DB Crash Help

Post by abrist »

When /tmp is full, it is usually due to check* files. This means than at some point, checks were getting saved to disk, but not reaped. The most common cause are passive checks that continue to get written to /tmp while the nagios process is stopped. This can also happen if you had multiple nagios parent processes running concurrently.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Nagios DB Crash Help

Post by jbennett »

And what is the best way to clear these? Just going through the /tmp directory and manually deleting them?

Being that this happened over the weekend, when no one else was in the system, I'm wondering if this might have been caused by the Nagios process getting stuck.

It seems that I'm still running into issues with incredibly high CPU usage since this all happened. I'll routinely be over 90% now.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios DB Crash Help

Post by abrist »

The best way to remove the checks is to use a wildcard:

Code: Select all

rm -f /tmp/check*
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
jbennett
Posts: 522
Joined: Mon Apr 16, 2012 3:00 pm

Re: Nagios DB Crash Help

Post by jbennett »

I would still like to know why this is happening in the first place.

I came in this moring and the system is completely locked up. I can't even access it via console.

From yesterday, I notced that /dev/shm was full. My understanding is that this is swap memory?

I was thinking that Nagios would finish processing and swap usage woudld drop back down.

What can I check to see what is causing everything to start building up and not process?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Nagios DB Crash Help

Post by abrist »

Do you use a ramdisk? (this can fill up a small small /tmp partition quick if npcd has an issue.

Checks get reaped by nagios, so unreaped "check*" files is usually due to nagios not running. You may get some hints as to what caused this by looking at your logs for the evening when things failed. Additionally, check the Audit Log in XI for any changes made late that night.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked