Page 1 of 2
Nagios stuck, won't check devices in queue
Posted: Fri Feb 10, 2012 9:34 am
by cwscribner
Hi all.
My XI server has within the past week developed some additional odd behavior. I have two devices that are pending and have been as such for about a week. Also, I deleted several hundred devices yesterday through the CCM and those changes aren't showing up either. Oddly enough the Apply Configuration is working fine. No timeouts or anything. Any thoughts on what would cause a seemingly arbitrary stop to device processing?
P.S. Its still checked the devices that are already accounted for. Its just the new changes that haven't been assimilated yet.
Re: Nagios stuck, won't check devices in queue
Posted: Fri Feb 10, 2012 2:21 pm
by scottwilkerson
Can we check to see if the configuration files for these hosts/services actually got deleted from
/usr/local/nagios/etc/ ( hosts or services directory)
Re: Nagios stuck, won't check devices in queue
Posted: Fri Feb 10, 2012 9:24 pm
by cwscribner
According to my co-worker who ran the analysis, there are ~1150 host config files for devices that are not in the database...
How in the heck can there be that much of a disconnect between the interface and actual config files?!
Re: Nagios stuck, won't check devices in queue
Posted: Mon Feb 13, 2012 10:35 am
by mguthrie
That is a substantial disconnect. It would be worth checking to verify permissions for files under the nagios/etc directory. You can reset these by running:
Code: Select all
/usr/local/nagiosxi/scripts/reset_config_perms
You might also try test deleting the a few hosts, and make sure there aren't any php timeouts or memory limit being hit when attempting this. You can tail the apache log and see if anything shows up.
Is it common in your environment for large numbers of hosts to be deleted at once?
Do you guys use the "active/inactive" functionality of the Core Config Manager much in your environment for your configs?
Re: Nagios stuck, won't check devices in queue
Posted: Mon Feb 13, 2012 10:48 am
by cwscribner
We often do delete hosts in bulk through the CCM. I have nagiosql set to produce ~200 lines to make it easier. When doing an apply config, it doesn't ever time out, but it takes several minutes to complete probably due to the amount of devices.
Re: Nagios stuck, won't check devices in queue
Posted: Mon Feb 13, 2012 11:25 am
by mguthrie
Unfortunately I don't have any obvious ideas that come to mind as to how so many of those could have gotten deleted by the CCM, but not in the files. It does happen periodically where a host deletion will fail to delete a file correctly, so it's gone from the CCM, but not from the XI interface. However, I've never seen it fail on that scale before.
The host files as safe to physically delete from the XI server. When you attempt to delete a host from the Core Config Manager do you get any error messages at the bottom of the page showing that the deletion failed?
Re: Nagios stuck, won't check devices in queue
Posted: Mon Feb 13, 2012 11:34 am
by cwscribner
Nope. I just removed ~400 devices via CCM and not one gave an error. The majority of the devices either have no services associated with them, or they use group associated config files.
Re: Nagios stuck, won't check devices in queue
Posted: Mon Feb 13, 2012 11:40 am
by mguthrie
Now just to clarify, these devices that you just removed, were the config files for them deleted correctly or do they still appear to be there? (/usr/local/nagios/etc/hosts)
Re: Nagios stuck, won't check devices in queue
Posted: Mon Feb 13, 2012 11:43 am
by cwscribner
I haven't done a 1:1 comparison but some are still there. There were two devices that got added last week and they never got past pending. They hung there for ~5 days before I deleted them. I think that's the point when this problem started.
Re: Nagios stuck, won't check devices in queue
Posted: Mon Feb 13, 2012 12:31 pm
by cwscribner
In theory, if I deleted ALL of the host configuration files then did an apply configuration, the database would propogate all of the proper config files, right?