Page 1 of 1

Host check retries

Posted: Tue Jun 17, 2008 2:47 am
by Guest
This is a multi-part message in MIME format.
--------------020600020808000509060801
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi all,

I've sent a related email last week to the nagios-users but got no
answer and since I believe I've found a bug, I'm posting to devel this
time. Sorry if it's not the right place (I believe it is) for reporting
this problem.
The related email is as an attachment and I've been doing some tests
since sending it.

Fisrt of all, let me show a part of my log file:

[1213641058] SERVICE ALERT: test;ping;CRITICAL;SOFT;1;CRITICAL -
x.x.x.x: rta nan, lost 100%
[1213641077] Warning: The check of host 'test' looks like it was
orphaned (results never came back). I'm scheduling an immediate check
of the host...
[1213641098] HOST ALERT: test;DOWN;SOFT;1;CRITICAL - x.x.x.x: rta nan,
lost 100%
[1213641098] GLOBAL HOST EVENT HANDLER:
test;(null);(null);(null);sd_host_incident
[1213641118] SERVICE ALERT: test;ping;CRITICAL;HARD;1;CRITICAL -
x.x.x.x: rta nan, lost 100%
[1213641118] HOST ALERT: test;DOWN;HARD;1;CRITICAL - x.x.x.x: rta nan,
lost 100%
[1213641118] HOST NOTIFICATION:
emanuel;test;DOWN;host-notify-by-email;CRITICAL - x.x.x.x: rta nan, lost
100%
[1213641118] HOST NOTIFICATION:
helpdesk;test;DOWN;fccn_HostNotify;CRITICAL - x.x.x.x: rta nan, lost 100%


And then some considerations: I am not using regularly scheduled host
checks, my host checks always delays 40s because it's the timeout seted
on check_icmp, the max_check_attempts is 10 and my general timeouts are:
service_check_timeout=90
host_check_timeout=120
event_handler_timeout=30
notification_timeout=30

The normal expected behavior is:
- when a host goes down, the first message is a service problem
- when a service problem is found, a host check is immediately executed
- the host check is executed and it's duration is 40s if not it's
stopped after 120s
- there will be 9 soft down states
- the 10th attempt will be a hard down state and notifications will be sent

The strange behavior I've found is:
- the nagios process waits only 20 seconds for the first host check and
not 120 as expected, showned as a warning message. Then nagios executes
an immediate host check
- the first check is received as normally after 40s, but nagios is
already executing another test
- the second check delays the normal ~40s but the check immediately
wents on HARD state.
- there aren't 10 attempt as expected by the max_check_attempts clause,
because of the strange behavior showned in here.


So my question is, is this a bug, or a configuration problem?

Thank you very much for any help, since this problem is driving my
helpdesk team nuts :O because of the false alarms.


Best regards,
Emanuel Massano


--------------020600020808000509060801
Content-Type: message/rfc822;
name="Attached Message"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="Attached Message"

Message-ID:
Date: Thu, 12 Jun 2008 14:21:55 +0100
From: Emanuel Massano
Organization: FCCN
User-Agent: Thunderbird 2.0.0.14 (Windows/20080421)
MIME-Version: 1.0
To: [email protected]
Subject: Host check retries
Content-Type: multipart/related;
boundary="------------030601070204010109000102"

This is a multi-part message in MIME format.
--------------030601070204010109000102
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit






Hi all,

I'm having a strange nagios behavior in my host checks. The problem is
the number of retries before a hard state is computed. I can't figure
out what's the problem, so I'm asking for some help from you.

My nagios setup is 3.02, with flap detection disabled. I think the rest
is a standard stand alone setup.

For my host configuration I don't want regularly scheduled host checks,
so I have the check_interval set to 0. The configuration is below:

<table x:str=""

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]