Page 1 of 1

Strange state change issue

Posted: Wed Mar 25, 2015 11:49 pm
by sspaise
Hi everyone,

I have a nagios server using a mix of gearman and NCSA checking. I'm having this strange problem, wondering if anyone has an idea as to why. Here's the issue:

1. A service turns into a WARNING or CRITICAL state
2. A Host check is received, OK (for gearman check only)
3. Ten seconds after the service failure, host reports CRITICAL (Down) - (host never actually goes down)
4. 50 seconds after host going critical, it checks in and reports OK again

Log (gearman passive check):
[1427343690] PASSIVE SERVICE CHECK: vm1-testvm;Service-Asterisk;2;NOK - Asterisk Service Down!!
[1427343690] SERVICE ALERT: vm1-testvm;Service-Asterisk;CRITICAL;HARD;1;NOK - Asterisk Service Down!!
[1427343690] PASSIVE HOST CHECK: vm1-testvm;0;OK
[1427343700] HOST ALERT: vm1-testvm;DOWN;HARD;1;CRITICAL: Host not reported in - probably down
[1427343750] PASSIVE HOST CHECK: vm1-testvm;0;OK
[1427343750] HOST ALERT: vm1-testvm;UP;HARD;1;OK

Log (NCSA check):
[1427344850] SERVICE ALERT: vm2-testvm;Memory;WARNING;HARD;1;WARNING: There have been no recent passive updates!
[1427344860] HOST ALERT: vm2-testvm;DOWN;HARD;1;CRITICAL: Host not reported in - probably down


This is happening for all hosts, and is becoming a pain what with 4 emails for every host when a service changes state. Any clues would be greatly appreciated.

Regards,
sspaise

Re: Strange state change issue

Posted: Thu Mar 26, 2015 2:49 pm
by lmiltchev
Do you have freshness enabled? Can you show us the "vm1-testvm" and "vm2-testvm" (and any other related) configs?

Re: Strange state change issue

Posted: Thu Mar 26, 2015 6:20 pm
by sspaise
Hi lmiltchev,

We do indeed have freshness checking enabled, both for the host and some of the services. Below I have included the configuration for "vm1-testvm" and its related configuration:

================
passive-host-invade
================

define host{
name passive-host-invade
use passive-host
freshness_threshold 600
contact_groups +invade
register 0
}

=====================
passive-service-invade.cfg
=====================

define service{
name passive-service-invade
use passive-service
contact_groups +invade
register 0
}


I've uploaded the "vm1-testvm.cfg" and "nagios.cfg" to this thread, if I can provide anything further just let me know!

Re: Strange state change issue

Posted: Fri Mar 27, 2015 1:39 pm
by tmcdonald
I am curious to see what you have in your passive-host and passive-service templates. Can you post them?

Re: Strange state change issue

Posted: Fri Mar 27, 2015 1:44 pm
by tgriep
Edit your nagios.cfg file and change

Code: Select all

check_host_freshness=0
to

Code: Select all

check_host_freshness=1
Then restart Nagios

Re: Strange state change issue

Posted: Fri Mar 27, 2015 2:22 pm
by sspaise
Hi all,

@tmcdonald, passive host and service templates were posted in my reply to lmiltchev =)

@tgriep, thank you, it looks like this has resolved the issue. Going to do some further testing just to be sure!

Re: Strange state change issue

Posted: Fri Mar 27, 2015 2:25 pm
by tmcdonald
sspaise wrote:@tmcdonald, passive host and service templates were posted in my reply to lmiltchev =)
*cough* I knew that :) Was just testing you.

I'll go ahead and close this up now.

Edit: Unlocking per OP request.

Re: Strange state change issue

Posted: Tue Mar 31, 2015 11:01 am
by sspaise
Hi All,

@tmcdonald, thanks for the unlock sir!

So unfortunately after further testing it was apparent that changing the "check_host_freshness" value in the nagios config made no difference.

I was reading some of the documentation the other day and came across "on demand checks", this got me thinking.

Currently we are seeing a service check go into a failed state, then 10 seconds later the host reports being DOWN. Based on what I was reading about on-demand checks I believe this to be caused by the nagios logic, that says, if a service for said host goes into a failed state, force a host check to see if the host has gone down, the active check fails and reports host being DOWN.

Does anyone know much about the on-demand checks, and can the host check be disabled if a service changes state? (we only want to use passive checks, no active)

Re: Strange state change issue

Posted: Tue Mar 31, 2015 5:19 pm
by ssax
The performance of on-demand host checks can be significantly improved by implementing the use of cached checks, which allow Nagios to forgo executing a host check if it determines a relatively recent check result will do instead.
Have you looked into cached checks? http://nagios.sourceforge.net/docs/nagi ... hecks.html