Re: [Nagios-devel] Race condition in freshness checking

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Race condition in freshness checking

Post by Guest »

Ton Voon wrote:
> Hi!
>
> We found a bug in the calculation of the latency for a passive check. This has
> highlighted a possible race condition re: freshness checking. We wanted to get
> some ideas on what is the best approach to fix this.
>
> Background:
>
> We have a master/slave arrangement, with freshness checking
> (freshness_threshold=0) of slave services on the master.
>
> Looking in the NDO db, we realised that the latency values for passive results
> were incorrectly calculate - sometimes latency values could be 1000x out. The
> patch is attached. However, since using this patch, we've seen occasional race
> conditions.
>
> Problem:
>
> Within checks.c:check_service_result_freshness, if a service has past its
> expiration_time, it is marked as is_being_freshened and a forced service check
> is scheduled. However, if a passive result for this service is processed before
> this forced check is run, then the service is marked as stale and the state is
> inconsistent between master and slave.
>
> Possible solutions:
>
> - If a check result is processed with is_being_freshened set for the service,
> then remove forced check from schedule if it exists.

Sounds like a good solution, since the service will be marked as 'is_being_checked'
when the check actually runs, in which case it's pointless to update the status
as it will be overwritten by the master's own active check anyways.

> - Change is_being_freshened to stale_time (0 if not stale). On running the
> forced check, if stale_time is less than last_check_time (+ latency?), break out
> of running the forced check.
>

This I didn't quite get. You mean the passive check should alter the figure passed
in is_being_freshened? If so, what if stale_time is exactly 1? How can Nagios then
determine that it's actually received a result rather than just being updated by
the passive check-result coming in.

I'm sure you thought of it, but the simplest way should be to re-check the timer
since last check arrived when the forced check is being run, and cancel it if it's
fresh enough then. That way you'll keep the the change to a single spot in the code
and it'll be quite maintainable, provided some comment is added that explains the
anomaly in the code-path.

--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked