Re: [Nagios-devel] Passive host down result is interpreted as up on

Guest · Post by **Guest** » Mon Mar 19, 2007 7:45 am

--Apple-Mail-17-590863139
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
delsp=yes;
format=flowed

Hi!

On 16 Mar 2007, at 18:02, Ton Voon wrote:

> I was wondering if anyone has seen this before. On a slave, we have
> a host that is marked as DOWN with a plugin output of "CRITICAL -
> Plugin timed out after 10 seconds", as expected. However, on the
> master, that host is marked as UP with the same text.
>
>
> The logs on the master server, show:
>
> [1174045717] EXTERNAL COMMAND:
> PROCESS_HOST_CHECK_RESULT;host1;0;PING OK - Packet loss = 0%, RTA =
> 0.37 ms|
>
> Host is marked as UP. Later on:
>
> [1174045949] EXTERNAL COMMAND:
> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after
> 10 seconds|
>
> Failure arrives.
>
> [1174045949] HOST ALERT: host1;DOWN;HARD;1;CRITICAL - Plugin timed
> out after 10 seconds
>
> Marked it as DOWN with alert. As expected.
>
> [1174045951] Warning: The results of service '/ - partition' on
> host 'host1' are stale by 24 seconds (threshold=82 seconds). I'm
> forcing an immediate check of the service.
> [1174045953] SERVICE ALERT: host1;/ - partition;UNKNOWN;HARD;
> 1;UNKNOWN: Service results are stale
> [1174045959] EXTERNAL COMMAND:
> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after
> 10 seconds|
>
> More passive results
>
> [1174045971] EXTERNAL COMMAND:
> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after
> 10 seconds|
>
> And again, but this time...
>
> [1174045973] HOST ALERT: host1;UP;HARD;1;CRITICAL - Plugin timed
> out after 10 seconds
>
> Nagios has marked the host as UP, even though the
> PROCESS_HOST_CHECK_RESULT is down.
>
>
> The complete nagios.log around this period is attached. I'm at a
> lost understanding why this has happened. Has anyone got any clues,
> or seen something similar?
>
> We haven't been able to reproduce this consistently yet.
>
> This is on Nagios 2.5 (with some local patches).

We think we've found the root problem.

In checks.c, if a host does not have a check_command, there is a
debug line that says: "No host check command specified, so no check
will be done (host state assumed to be unchanged)". However, it then
returns HOST_UP. We have amended this to return hst->current_state
instead.

In our distributed setup, we define a host without a check_command,
instead relying on the passive host results sent by the slave.
However, on the master, if a service on this host passes its
freshness threshold, a host check is scheduled, with the force flag.
This then gets to this portion of the code and returns a HOST_UP
state rather than the current state, thus showing an incorrect state
for the host.

Our patch is below, made against nagios 2.8.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

--Apple-Mail-17-590863139
Content-Transfer-Encoding: 7bit
Content-Type: application/octet-stream; x-unix-mode=0644;
name=nagios_no_host_check_command_returns_current_state.patch
Content-Disposition: attachment;
filename=nagios_no_host_check_command_returns_current_state.patch

diff -ur nagios-2.8.original/base/checks.c nagios-2.8/base/checks.c
--- nagios-2.8.original/base/checks.c 2007-03-19 15:16:38.375621511 +0000
+++ nagios-2.8/base/checks.c 2007-03-19 15:19:31.983526254 +0000
@@ -2427,7 +2427,9 @@
printf("\tNo host check command specified, so no check will be done (host state assumed to be unchanged)!\n");
#endif

- return HOST_UP;
+ /* Altinity patch: This should return the current state, rather than assume server is up. Incorrect in a distributed setup */
+ /* return HOST_UP; */
+ return hst->current_state;
}

/* grab the host macros */

--Apple-Mail-17-590863139
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
format=flowed

--Apple-Mail-17-590863139--

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]