Re: [Nagios-devel] Passive host down result is interpreted as up

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Passive host down result is interpreted as up

Post by Guest »

Ton Voon wrote:
> Hi!
>
> On 16 Mar 2007, at 18:02, Ton Voon wrote:
>
>> I was wondering if anyone has seen this before. On a slave, we have a
>> host that is marked as DOWN with a plugin output of "CRITICAL - Plugin
>> timed out after 10 seconds", as expected. However, on the master, that
>> host is marked as UP with the same text.
>>
>>
>> The logs on the master server, show:
>>
>> [1174045717] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;host1;0;PING
>> OK - Packet loss = 0%, RTA = 0.37 ms|
>>
>> Host is marked as UP. Later on:
>>
>> [1174045949] EXTERNAL COMMAND:
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10
>> seconds|
>>
>> Failure arrives.
>>
>> [1174045949] HOST ALERT: host1;DOWN;HARD;1;CRITICAL - Plugin timed out
>> after 10 seconds
>>
>> Marked it as DOWN with alert. As expected.
>>
>> [1174045951] Warning: The results of service '/ - partition' on host
>> 'host1' are stale by 24 seconds (threshold=82 seconds). I'm forcing
>> an immediate check of the service.
>> [1174045953] SERVICE ALERT: host1;/ -
>> partition;UNKNOWN;HARD;1;UNKNOWN: Service results are stale
>> [1174045959] EXTERNAL COMMAND:
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10
>> seconds|
>>
>> More passive results
>>
>> [1174045971] EXTERNAL COMMAND:
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10
>> seconds|
>>
>> And again, but this time...
>>
>> [1174045973] HOST ALERT: host1;UP;HARD;1;CRITICAL - Plugin timed out
>> after 10 seconds
>>
>> Nagios has marked the host as UP, even though the
>> PROCESS_HOST_CHECK_RESULT is down.
>>
>>
>> The complete nagios.log around this period is attached. I'm at a lost
>> understanding why this has happened. Has anyone got any clues, or seen
>> something similar?
>>
>> We haven't been able to reproduce this consistently yet.
>>
>> This is on Nagios 2.5 (with some local patches).
>
>
> We think we've found the root problem.
>
> In checks.c, if a host does not have a check_command, there is a debug
> line that says: "No host check command specified, so no check will be
> done (host state assumed to be unchanged)". However, it then returns
> HOST_UP. We have amended this to return hst->current_state instead.
>
> In our distributed setup, we define a host without a check_command,
> instead relying on the passive host results sent by the slave. However,
> on the master, if a service on this host passes its freshness threshold,
> a host check is scheduled, with the force flag. This then gets to this
> portion of the code and returns a HOST_UP state rather than the current
> state, thus showing an incorrect state for the host.
>
> Our patch is below, made against nagios 2.8.
>
> Ton
>

Good catch! I'll get this into CVS pronto.


Ethan Galstad,
Nagios Developer
---
Email: [email protected]
Website: http://www.nagios.org





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked