Page 1 of 1

General Host/Service Counts OK, Dashboards Zeroed

Posted: Wed Jan 02, 2013 12:00 pm
by sav2880
This one requires a little bit of a backstory ...

* We're using CentOS 6.0 to host our Nagios instance, and about a week ago, the drive filled up ... not on space, but on inodes, through a large influx of files in the /tmp folder. I don't know if this is a CentOS or a Nagios issue, I only found one instance of this happening with anyone else. Anyways, got the /tmp directory cleaned up, and then after that, the message log caused the disk to actually fill up! Deleted that messages file, and so we're finally back at a place where we have plenty of inodes and plenty of space.

However, upon getting Nagios back (and Nagios clearing out some old emails that it wasn't able to send), initially every host/service detail was zeroed out, even though there was a full configuration in place. Additionally, when I attempted to backup Nagios via script (xi_backup.sh), it said the MySQL database didn't have the correct password. I ran the script to fix the MySQL Database, which did fix it to the point that I could make a solid backup. It also allowed the host/service checks to be normal again, but, there are some glitches that I'd like to fix that I cannot explain:

* Our host/service checks for dashboards (based on service groups) are all still showing zeroes.
* While in a list view, service checks properly show their state and description, if you drill down into the service check, it says that the service check is pending, and all of the statuses for the service check are marked red as if they're inactive.

Now, I think Nagios is still doing its thing, or it's trying to. Active service checks under the monitoring engine still show plenty of activity in the 1/5/15 minute intervals, but I'm curious what I might need to do in regards to fixing some of the visibility issues that still remain. I am willing to upgrade to the 2012 versions if this will resolve the issue, as I'd like to get to this point anyways, but I want to find the cause of the issue before I go trying to upgrade just for the sake of upgrading.

Also, the /tmp directory does have a large number of check* files in them. I assume this is normal behavior, but can you confirm such?

Current Version: 2011R3.3

Any other information you might need, I'll be happy to provide.

Re: General Host/Service Counts OK, Dashboards Zeroed

Posted: Wed Jan 02, 2013 12:04 pm
by scottwilkerson
this is very likely db corruption.

run the following

Code: Select all

/usr/local/nagiosxi/scripts/repairmysql.sh nagios

Re: General Host/Service Counts OK, Dashboards Zeroed

Posted: Wed Jan 02, 2013 12:09 pm
by sav2880
scottwilkerson wrote:this is very likely db corruption.

run the following

Code: Select all

/usr/local/nagiosxi/scripts/repairmysql.sh nagios
Re-running this now ... this was the command I had run when all of the service/host checks showed at zero, but I'm taking extra steps to ensure all of the active services are stopped before this repair this time around. Might have been why it didn't 100% work.

Re: General Host/Service Counts OK, Dashboards Zeroed

Posted: Wed Jan 02, 2013 12:15 pm
by abrist
Also, the /tmp directory does have a large number of check* files in them. I assume this is normal behavior, but can you confirm such?
Go ahead and delete them, they are most likely the cause of the inode problem.

Re: General Host/Service Counts OK, Dashboards Zeroed

Posted: Wed Jan 02, 2013 12:51 pm
by sav2880
Consider them deleted. I still see check* files being generated, but they're also being processed properly, that's better than it was doing before.

Additionally, I think I've solved the hostgroup/servicegroup display issue post-repair ... I had to make a trivial change to a piece of the service groups and host groups configurations (in my case, I just changed a description), and now it's showing back up okay again. I assume the minor configuration change triggered some of the corruptions to be properly rebuilt?

Re: General Host/Service Counts OK, Dashboards Zeroed

Posted: Wed Jan 02, 2013 2:38 pm
by abrist
Changing a description should not have caused the problem. You may have had a combination of crashed tables and services unable to come up when the disk filled. You may want to verify your XI config files just to be sure. If all checks out, go ahead and try to make your description changes again and let us know if your problems occur.

I have seen this happen one other time. The disk filled, space was cleared up, but passive check results were using up the inodes before nagios could reap them. This led to the inability to create a lock for neither nagios nor mysqld.