Chef client service showing intermittently down
Posted: Wed Mar 04, 2015 11:06 am
Hi folks,
So I'm pretty new to Nagios (version 4.0.7, specifically) and only been working with it for about 3 weeks now. Currently, I have an issue that I've been attempting to troubleshoot but haven't had any success nor have I been able to find much about it with my limited knowledge. I'll do my best to explain as this is something I've been working on since I started with Nagios and NSCA, so forgive me for length or lack of clarity...
I have an environment that I inherited where, among other active and passive checks, has a passive check for checking if a host's Chef client is running every 90 minutes. Before the issue began, I was attempting to modify commands.cfg and services.cfg to put in an active check to see if Apache was running on our hosts or not. After I made the change and restarted Nagios and NSCA, I started having the following issue (I'd like to note that once this issue started, I reverted my changes back to the prior configuration):
On our Nagios site, if I refresh the Host Group page or any page viewing services on any given host that has a Chef client running, the Critical alarm for the Chef Client Status (information on the alarm states, "CRITICAL: No report within specified interval") will appear and reappear (see here for a visual). The interval for the visual flapping (for want of a better word) is less than one minute. Now, if I manually do a check from the Nagios server CLI (check_nrpe -c check_chef_client -H <insert host IP>) repeatedly for the same host, I don't see any state changes for it as it comes back with an OK status. Even manually going to the servers and monitoring the Chef client, again, no change in its running state.
I figured there might have been something stale in what I gather (please correct me if I'm wrong) are the files where Nagios stores the states of each host along with the configurations of notifications, active/passive checks, etc., rentention.dat and status.dat. Thus, I stopped Nagios, removed those two files, started Nagios back up and restarted NSCA after Nagios was started. No joy though. I've also gone to the hosts manually in the web UI and re-scheduled the check of the service to do it immediately and do a force check, also to no avail.
If there's any additional info that you might need to assist (logs, etc.), I'll give what I can. In advance, thank you for any help that you can provide!
So I'm pretty new to Nagios (version 4.0.7, specifically) and only been working with it for about 3 weeks now. Currently, I have an issue that I've been attempting to troubleshoot but haven't had any success nor have I been able to find much about it with my limited knowledge. I'll do my best to explain as this is something I've been working on since I started with Nagios and NSCA, so forgive me for length or lack of clarity...
I have an environment that I inherited where, among other active and passive checks, has a passive check for checking if a host's Chef client is running every 90 minutes. Before the issue began, I was attempting to modify commands.cfg and services.cfg to put in an active check to see if Apache was running on our hosts or not. After I made the change and restarted Nagios and NSCA, I started having the following issue (I'd like to note that once this issue started, I reverted my changes back to the prior configuration):
On our Nagios site, if I refresh the Host Group page or any page viewing services on any given host that has a Chef client running, the Critical alarm for the Chef Client Status (information on the alarm states, "CRITICAL: No report within specified interval") will appear and reappear (see here for a visual). The interval for the visual flapping (for want of a better word) is less than one minute. Now, if I manually do a check from the Nagios server CLI (check_nrpe -c check_chef_client -H <insert host IP>) repeatedly for the same host, I don't see any state changes for it as it comes back with an OK status. Even manually going to the servers and monitoring the Chef client, again, no change in its running state.
I figured there might have been something stale in what I gather (please correct me if I'm wrong) are the files where Nagios stores the states of each host along with the configurations of notifications, active/passive checks, etc., rentention.dat and status.dat. Thus, I stopped Nagios, removed those two files, started Nagios back up and restarted NSCA after Nagios was started. No joy though. I've also gone to the hosts manually in the web UI and re-scheduled the check of the service to do it immediately and do a force check, also to no avail.
If there's any additional info that you might need to assist (logs, etc.), I'll give what I can. In advance, thank you for any help that you can provide!