General Host/Service Counts OK, Dashboards Zeroed
Posted: Wed Jan 02, 2013 12:00 pm
This one requires a little bit of a backstory ...
* We're using CentOS 6.0 to host our Nagios instance, and about a week ago, the drive filled up ... not on space, but on inodes, through a large influx of files in the /tmp folder. I don't know if this is a CentOS or a Nagios issue, I only found one instance of this happening with anyone else. Anyways, got the /tmp directory cleaned up, and then after that, the message log caused the disk to actually fill up! Deleted that messages file, and so we're finally back at a place where we have plenty of inodes and plenty of space.
However, upon getting Nagios back (and Nagios clearing out some old emails that it wasn't able to send), initially every host/service detail was zeroed out, even though there was a full configuration in place. Additionally, when I attempted to backup Nagios via script (xi_backup.sh), it said the MySQL database didn't have the correct password. I ran the script to fix the MySQL Database, which did fix it to the point that I could make a solid backup. It also allowed the host/service checks to be normal again, but, there are some glitches that I'd like to fix that I cannot explain:
* Our host/service checks for dashboards (based on service groups) are all still showing zeroes.
* While in a list view, service checks properly show their state and description, if you drill down into the service check, it says that the service check is pending, and all of the statuses for the service check are marked red as if they're inactive.
Now, I think Nagios is still doing its thing, or it's trying to. Active service checks under the monitoring engine still show plenty of activity in the 1/5/15 minute intervals, but I'm curious what I might need to do in regards to fixing some of the visibility issues that still remain. I am willing to upgrade to the 2012 versions if this will resolve the issue, as I'd like to get to this point anyways, but I want to find the cause of the issue before I go trying to upgrade just for the sake of upgrading.
Also, the /tmp directory does have a large number of check* files in them. I assume this is normal behavior, but can you confirm such?
Current Version: 2011R3.3
Any other information you might need, I'll be happy to provide.
* We're using CentOS 6.0 to host our Nagios instance, and about a week ago, the drive filled up ... not on space, but on inodes, through a large influx of files in the /tmp folder. I don't know if this is a CentOS or a Nagios issue, I only found one instance of this happening with anyone else. Anyways, got the /tmp directory cleaned up, and then after that, the message log caused the disk to actually fill up! Deleted that messages file, and so we're finally back at a place where we have plenty of inodes and plenty of space.
However, upon getting Nagios back (and Nagios clearing out some old emails that it wasn't able to send), initially every host/service detail was zeroed out, even though there was a full configuration in place. Additionally, when I attempted to backup Nagios via script (xi_backup.sh), it said the MySQL database didn't have the correct password. I ran the script to fix the MySQL Database, which did fix it to the point that I could make a solid backup. It also allowed the host/service checks to be normal again, but, there are some glitches that I'd like to fix that I cannot explain:
* Our host/service checks for dashboards (based on service groups) are all still showing zeroes.
* While in a list view, service checks properly show their state and description, if you drill down into the service check, it says that the service check is pending, and all of the statuses for the service check are marked red as if they're inactive.
Now, I think Nagios is still doing its thing, or it's trying to. Active service checks under the monitoring engine still show plenty of activity in the 1/5/15 minute intervals, but I'm curious what I might need to do in regards to fixing some of the visibility issues that still remain. I am willing to upgrade to the 2012 versions if this will resolve the issue, as I'd like to get to this point anyways, but I want to find the cause of the issue before I go trying to upgrade just for the sake of upgrading.
Also, the /tmp directory does have a large number of check* files in them. I assume this is normal behavior, but can you confirm such?
Current Version: 2011R3.3
Any other information you might need, I'll be happy to provide.