Hi everyone,
I have a nagios server using a mix of gearman and NCSA checking. I'm having this strange problem, wondering if anyone has an idea as to why. Here's the issue:
1. A service turns into a WARNING or CRITICAL state
2. A Host check is received, OK (for gearman check only)
3. Ten seconds after the service failure, host reports CRITICAL (Down) - (host never actually goes down)
4. 50 seconds after host going critical, it checks in and reports OK again
Log (gearman passive check):
[1427343690] PASSIVE SERVICE CHECK: vm1-testvm;Service-Asterisk;2;NOK - Asterisk Service Down!!
[1427343690] SERVICE ALERT: vm1-testvm;Service-Asterisk;CRITICAL;HARD;1;NOK - Asterisk Service Down!!
[1427343690] PASSIVE HOST CHECK: vm1-testvm;0;OK
[1427343700] HOST ALERT: vm1-testvm;DOWN;HARD;1;CRITICAL: Host not reported in - probably down
[1427343750] PASSIVE HOST CHECK: vm1-testvm;0;OK
[1427343750] HOST ALERT: vm1-testvm;UP;HARD;1;OK
Log (NCSA check):
[1427344850] SERVICE ALERT: vm2-testvm;Memory;WARNING;HARD;1;WARNING: There have been no recent passive updates!
[1427344860] HOST ALERT: vm2-testvm;DOWN;HARD;1;CRITICAL: Host not reported in - probably down
This is happening for all hosts, and is becoming a pain what with 4 emails for every host when a service changes state. Any clues would be greatly appreciated.
Regards,
sspaise
Strange state change issue
Re: Strange state change issue
Do you have freshness enabled? Can you show us the "vm1-testvm" and "vm2-testvm" (and any other related) configs?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Strange state change issue
Hi lmiltchev,
We do indeed have freshness checking enabled, both for the host and some of the services. Below I have included the configuration for "vm1-testvm" and its related configuration:
================
passive-host-invade
================
define host{
name passive-host-invade
use passive-host
freshness_threshold 600
contact_groups +invade
register 0
}
=====================
passive-service-invade.cfg
=====================
define service{
name passive-service-invade
use passive-service
contact_groups +invade
register 0
}
I've uploaded the "vm1-testvm.cfg" and "nagios.cfg" to this thread, if I can provide anything further just let me know!
We do indeed have freshness checking enabled, both for the host and some of the services. Below I have included the configuration for "vm1-testvm" and its related configuration:
================
passive-host-invade
================
define host{
name passive-host-invade
use passive-host
freshness_threshold 600
contact_groups +invade
register 0
}
=====================
passive-service-invade.cfg
=====================
define service{
name passive-service-invade
use passive-service
contact_groups +invade
register 0
}
I've uploaded the "vm1-testvm.cfg" and "nagios.cfg" to this thread, if I can provide anything further just let me know!
- Attachments
-
- nagios.cfg
- nagios.cfg
- (43.74 KiB) Downloaded 260 times
-
- vm1-testvm.cfg
- vm1-testvm.cfg
- (8.46 KiB) Downloaded 255 times
Re: Strange state change issue
I am curious to see what you have in your passive-host and passive-service templates. Can you post them?
Former Nagios employee
Re: Strange state change issue
Edit your nagios.cfg file and change
to
Then restart Nagios
Code: Select all
check_host_freshness=0
Code: Select all
check_host_freshness=1
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Strange state change issue
Hi all,
@tmcdonald, passive host and service templates were posted in my reply to lmiltchev =)
@tgriep, thank you, it looks like this has resolved the issue. Going to do some further testing just to be sure!
@tmcdonald, passive host and service templates were posted in my reply to lmiltchev =)
@tgriep, thank you, it looks like this has resolved the issue. Going to do some further testing just to be sure!
Re: Strange state change issue
*cough* I knew that :) Was just testing you.sspaise wrote:@tmcdonald, passive host and service templates were posted in my reply to lmiltchev =)
I'll go ahead and close this up now.
Edit: Unlocking per OP request.
Former Nagios employee
Re: Strange state change issue
Hi All,
@tmcdonald, thanks for the unlock sir!
So unfortunately after further testing it was apparent that changing the "check_host_freshness" value in the nagios config made no difference.
I was reading some of the documentation the other day and came across "on demand checks", this got me thinking.
Currently we are seeing a service check go into a failed state, then 10 seconds later the host reports being DOWN. Based on what I was reading about on-demand checks I believe this to be caused by the nagios logic, that says, if a service for said host goes into a failed state, force a host check to see if the host has gone down, the active check fails and reports host being DOWN.
Does anyone know much about the on-demand checks, and can the host check be disabled if a service changes state? (we only want to use passive checks, no active)
@tmcdonald, thanks for the unlock sir!
So unfortunately after further testing it was apparent that changing the "check_host_freshness" value in the nagios config made no difference.
I was reading some of the documentation the other day and came across "on demand checks", this got me thinking.
Currently we are seeing a service check go into a failed state, then 10 seconds later the host reports being DOWN. Based on what I was reading about on-demand checks I believe this to be caused by the nagios logic, that says, if a service for said host goes into a failed state, force a host check to see if the host has gone down, the active check fails and reports host being DOWN.
Does anyone know much about the on-demand checks, and can the host check be disabled if a service changes state? (we only want to use passive checks, no active)
Re: Strange state change issue
Have you looked into cached checks? http://nagios.sourceforge.net/docs/nagi ... hecks.htmlThe performance of on-demand host checks can be significantly improved by implementing the use of cached checks, which allow Nagios to forgo executing a host check if it determines a relatively recent check result will do instead.