Nagios stuck, won't check devices in queue
-
cwscribner
- Posts: 316
- Joined: Thu Mar 31, 2011 9:54 am
- Location: Patten, ME
- Contact:
Nagios stuck, won't check devices in queue
Hi all.
My XI server has within the past week developed some additional odd behavior. I have two devices that are pending and have been as such for about a week. Also, I deleted several hundred devices yesterday through the CCM and those changes aren't showing up either. Oddly enough the Apply Configuration is working fine. No timeouts or anything. Any thoughts on what would cause a seemingly arbitrary stop to device processing?
P.S. Its still checked the devices that are already accounted for. Its just the new changes that haven't been assimilated yet.
My XI server has within the past week developed some additional odd behavior. I have two devices that are pending and have been as such for about a week. Also, I deleted several hundred devices yesterday through the CCM and those changes aren't showing up either. Oddly enough the Apply Configuration is working fine. No timeouts or anything. Any thoughts on what would cause a seemingly arbitrary stop to device processing?
P.S. Its still checked the devices that are already accounted for. Its just the new changes that haven't been assimilated yet.
Last edited by cwscribner on Mon Feb 20, 2012 4:55 pm, edited 1 time in total.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Nagios stuck, won't check devices in queue
Can we check to see if the configuration files for these hosts/services actually got deleted from
/usr/local/nagios/etc/ ( hosts or services directory)
/usr/local/nagios/etc/ ( hosts or services directory)
-
cwscribner
- Posts: 316
- Joined: Thu Mar 31, 2011 9:54 am
- Location: Patten, ME
- Contact:
Re: Nagios stuck, won't check devices in queue
According to my co-worker who ran the analysis, there are ~1150 host config files for devices that are not in the database...
How in the heck can there be that much of a disconnect between the interface and actual config files?!
How in the heck can there be that much of a disconnect between the interface and actual config files?!
Re: Nagios stuck, won't check devices in queue
That is a substantial disconnect. It would be worth checking to verify permissions for files under the nagios/etc directory. You can reset these by running:
You might also try test deleting the a few hosts, and make sure there aren't any php timeouts or memory limit being hit when attempting this. You can tail the apache log and see if anything shows up.
Is it common in your environment for large numbers of hosts to be deleted at once?
Do you guys use the "active/inactive" functionality of the Core Config Manager much in your environment for your configs?
Code: Select all
/usr/local/nagiosxi/scripts/reset_config_permsYou might also try test deleting the a few hosts, and make sure there aren't any php timeouts or memory limit being hit when attempting this. You can tail the apache log and see if anything shows up.
Code: Select all
tail -f /var/log/httpd/error_logDo you guys use the "active/inactive" functionality of the Core Config Manager much in your environment for your configs?
-
cwscribner
- Posts: 316
- Joined: Thu Mar 31, 2011 9:54 am
- Location: Patten, ME
- Contact:
Re: Nagios stuck, won't check devices in queue
We often do delete hosts in bulk through the CCM. I have nagiosql set to produce ~200 lines to make it easier. When doing an apply config, it doesn't ever time out, but it takes several minutes to complete probably due to the amount of devices.
Re: Nagios stuck, won't check devices in queue
Unfortunately I don't have any obvious ideas that come to mind as to how so many of those could have gotten deleted by the CCM, but not in the files. It does happen periodically where a host deletion will fail to delete a file correctly, so it's gone from the CCM, but not from the XI interface. However, I've never seen it fail on that scale before.
The host files as safe to physically delete from the XI server. When you attempt to delete a host from the Core Config Manager do you get any error messages at the bottom of the page showing that the deletion failed?
The host files as safe to physically delete from the XI server. When you attempt to delete a host from the Core Config Manager do you get any error messages at the bottom of the page showing that the deletion failed?
-
cwscribner
- Posts: 316
- Joined: Thu Mar 31, 2011 9:54 am
- Location: Patten, ME
- Contact:
Re: Nagios stuck, won't check devices in queue
Nope. I just removed ~400 devices via CCM and not one gave an error. The majority of the devices either have no services associated with them, or they use group associated config files.
Re: Nagios stuck, won't check devices in queue
Now just to clarify, these devices that you just removed, were the config files for them deleted correctly or do they still appear to be there? (/usr/local/nagios/etc/hosts)
-
cwscribner
- Posts: 316
- Joined: Thu Mar 31, 2011 9:54 am
- Location: Patten, ME
- Contact:
Re: Nagios stuck, won't check devices in queue
I haven't done a 1:1 comparison but some are still there. There were two devices that got added last week and they never got past pending. They hung there for ~5 days before I deleted them. I think that's the point when this problem started.
-
cwscribner
- Posts: 316
- Joined: Thu Mar 31, 2011 9:54 am
- Location: Patten, ME
- Contact:
Re: Nagios stuck, won't check devices in queue
In theory, if I deleted ALL of the host configuration files then did an apply configuration, the database would propogate all of the proper config files, right?