Chef client service showing intermittently down

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
TrezOne
Posts: 5
Joined: Thu Feb 26, 2015 10:10 am

Chef client service showing intermittently down

Post by TrezOne »

Hi folks,

So I'm pretty new to Nagios (version 4.0.7, specifically) and only been working with it for about 3 weeks now. Currently, I have an issue that I've been attempting to troubleshoot but haven't had any success nor have I been able to find much about it with my limited knowledge. I'll do my best to explain as this is something I've been working on since I started with Nagios and NSCA, so forgive me for length or lack of clarity...

I have an environment that I inherited where, among other active and passive checks, has a passive check for checking if a host's Chef client is running every 90 minutes. Before the issue began, I was attempting to modify commands.cfg and services.cfg to put in an active check to see if Apache was running on our hosts or not. After I made the change and restarted Nagios and NSCA, I started having the following issue (I'd like to note that once this issue started, I reverted my changes back to the prior configuration):

On our Nagios site, if I refresh the Host Group page or any page viewing services on any given host that has a Chef client running, the Critical alarm for the Chef Client Status (information on the alarm states, "CRITICAL: No report within specified interval") will appear and reappear (see here for a visual). The interval for the visual flapping (for want of a better word) is less than one minute. Now, if I manually do a check from the Nagios server CLI (check_nrpe -c check_chef_client -H <insert host IP>) repeatedly for the same host, I don't see any state changes for it as it comes back with an OK status. Even manually going to the servers and monitoring the Chef client, again, no change in its running state.

I figured there might have been something stale in what I gather (please correct me if I'm wrong) are the files where Nagios stores the states of each host along with the configurations of notifications, active/passive checks, etc., rentention.dat and status.dat. Thus, I stopped Nagios, removed those two files, started Nagios back up and restarted NSCA after Nagios was started. No joy though. I've also gone to the hosts manually in the web UI and re-scheduled the check of the service to do it immediately and do a force check, also to no avail.

If there's any additional info that you might need to assist (logs, etc.), I'll give what I can. In advance, thank you for any help that you can provide!
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Chef client service showing intermittently down

Post by abrist »

I would assume one of two things are happening:
1) Nagios is not receiving the check and the results are ending up as stale.
2) The freshness check interval is too low as the passive service only reports every 90 minutes, the freshness interval should probably be larger than that.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
TrezOne
Posts: 5
Joined: Thu Feb 26, 2015 10:10 am

Re: Chef client service showing intermittently down

Post by TrezOne »

I'm learning more towards the fact that the results are stale. Another thing I should mention (and sorry that I didn't before), is that when you're looking at a host with the Chef Client Status service in CRITICAL, the duration on the alarm is 8 days; when it's showing as OK, the duration is at 2 hours (at this time of posting). Assuming if the results are stale, where are the results actually stored? I was under the impression that they were stored in retention.dat and status.dat, no?
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Chef client service showing intermittently down

Post by abrist »

If the results are stale, it most likely has not received a checkresult recently (like 8 days in your example).
Can you verify that the remote host is still sending passive results?
If it is, check the core event log to see if a passive for the service in question was received recently . . . .
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
TrezOne
Posts: 5
Joined: Thu Feb 26, 2015 10:10 am

Re: Chef client service showing intermittently down

Post by TrezOne »

It's conflicting, at best. I mean, within the nagios log, it shows:

Code: Select all

[timestamp] CURRENT SERVICE STATE: [hostname];Chef Client Run Status;CRITICAL;HARD;1;CRITICAL: No report within specified interval
And yet, if you look at the screen shots here, it looks as though the remote host(s) is/are reporting back. Mind you, both those screenshots were taken within 2 minutes of each other :shock: And the service on every single host just flickers back and forth between those states shown in the screenshots.

EDIT: Oh, and active checks are disabled for the host; wasn't sure if this was the proper config, but it looks like the last guy before me only had passive checks configured for this.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Chef client service showing intermittently down

Post by scottwilkerson »

Do you have freshness enabled on the object config?

Can you post the relevant nagios configs and templates related to these checks?
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
TrezOne
Posts: 5
Joined: Thu Feb 26, 2015 10:10 am

Re: Chef client service showing intermittently down

Post by TrezOne »

Hi scott,

Sorry about the delayed reply, been busy. I do have freshness check enabled; below is a snippet from my services.cfg for the passive checks.

EDIT: I also want to reiterate that this behavior's intermittent. Again, one moment the Nagios Web UI will show all hosts with the Chef Client as "CRITICAL: No report specified within interval" and then the next moment, after a refresh, it'll revert to an OK state, rinse and repeat.

Code: Select all

define service {
  use         generic-service
  name        passive-service-base
  active_checks_enabled   0   
  passive_checks_enabled    1   
  flap_detection_enabled    0   
  register      0   
  max_check_attempts    1   
  retry_check_interval    1   
  check_freshness     1   
  check_command     no-report
}
....
define service {
  use       passive-service-base
  name        passive-service-90min
  freshness_threshold   5400  ;90 mins
  register      0   
}
....
define service{
  use                             passive-service-90min
  hostgroup_name                  chef-client-enabled
  service_description             Chef Client Run Status
}
TrezOne
Posts: 5
Joined: Thu Feb 26, 2015 10:10 am

Re: Chef client service showing intermittently down

Post by TrezOne »

Never mind, it ended up being something extremely simple: there were two Nagios processes that were running. Sorry for the hubaloo folks :) Like I said, still new to Nagios and learning the ins and outs of it.
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Chef client service showing intermittently down

Post by cmerchant »

Glad it was something easy to fix. We'll go ahead and close the thread. Thanks.
Locked